By Wolfgang Unger
Introduction
Is your AWS account and architecture really setup correctly?
Is it really secure, resilent and reliable?
After working now about 10 years with AWS and working on many different customer projects,
I have seen quite a lot of typical and common errors and anti patterns so far.
It is really not unusual you find critical security issues and bad practices in production accounts.
This blog will cover the most common issues which you normally find on AWS.
Are you in doubt your AWS architecture was setup with best practices ?
You can use this article as a checklist and do a self-review on your account and architecture
If you find some of the mentioned issues in your account, you should immediatelly take actions.
The 5 most common errors and problems we will have a closer look are:
- IAM and Security Issues
- No Snapshots and Backups - Reliability, Failure management (RPO)
- Non resilent architectures, single point of failures - Reliability : Workload architecture
- No usage of IaC (RTO)
- No Monitoring, Alarming and Logging
To understand this blog, you should know what RPO and RTO are, if not, please have a look in this blog from our site:
Disaster Recovery - RPO and RTO
IAM and Security Issues
Unfortunatelly really common are serious security issues and bad practices on AWS accounts.
I have written a dedicated blog about security best practices, please read it to get more
informations, if you will find some of the following mentioned issues .
Security Best Practices on AWS
In here we will cover this a little bit shorter and focus on the most critical and most widespread errors.
Avoid IAM Users, use Roles instead.
This is valid for real users and for application permissions.
Real users need only one IAM user in the master account, in all other accounts they can use switch role.
This means, he got only one Access Key & Secret Key Pair and need to store only one ( long & strong) password,
the user got of course got MFA enabled.
Applications don't need IAM users to work properly and therefore don't need Access Keys ( for example
passed as env variables in clearcode or in properties file inside application)
Use roles instead, no Access Keys required!
If you want to perform some AWS commands by the SDK, for example boto or .NET or SDK library, you don't need
to pass a user or credentials when constructing the client.
Use the default constructor and the client will use the attached instance role.
If you have Access Keys & Secrets in cleartext either in your env variables, properties or git, fix this immediatelly.
Violation of principle of least privilege
Grant least access while creating IAM Policies, to perform only the necessary actions.
Don't assign the Adminstrator Policy to anybody and also not to your applications! Only assign rights which are really needed.
If your application got a Adminstrator Policy attached and got hacked, the agressor can delete almost anything in your account.
Internet-facing Databases, Instances and Applications
Take care only the web-tier is internet-facing, not the application or database tier.
This is a very common and wide spread anti pattern unfortunatelly.
You Database don't have to be (and must not be) open to 0.0.0.0\0 ( the whole internet)!
If your developers must access the database with their local DB-clients, they should do it over a bastion host.
Also a application server must not be internet-facing.
If you need to ssh into a server use Session Manager, don't open Port 22 to the internet.
Only the web-server needs to be public available.
In fact the best solution is, only the Loadbalancers should be internet-facing and they will forward
traffic to the servers, which are protected in private subnets.
Keep your databases in private subnets, same for application servers and all clusters and best also for
your webservers. Put a loadbalancer in front of it.
Enable WAF for your internet-facing resources if you want to be really secure.
No Snapshots and Backups - Reliability, Failure management (RPO)
Also not rarely to see, the total miss of snapshots and backups.
No database snapshots, EBS backups of stateful or manually configured servers, no backups for the data and storage in AWS.
If you want to take Backup seriously take a look into this blog of me:
AWS Backup - Multi Region and Multi Account Setup
You don't have to backup multi region or multi account (of course depending of your requirements), but you should backup.
You should create daily snapshots of your databases and servers, which cannot be simply restarted after a loss of data or crash.
We are talking here about RPO (Recovery Point Objective), the dataloss after an incident you can tolerate.
If you do nightly snapshots, lets say at 1 AM and there is an occurence of your database at 11AM and you have
to restore a snapshot, your RPO will be 10 hours. With nightly snapshots your maximum RPO is 24h.
If you do no backups and snapshots at all and you have a incident you will be in real trouble.
Good luck trying to restore your data ! Your RPO will probably be weeks or worse your data will be completly lost.
If you store important data in S3 think about cross region or cross account replication.
Enable delete protection or versioning.
Without this, you or somebody could simply delete the data in S3 and
there is no chance to restore a non versioned object.
Non resilent architectures, single point of failures - Reliability : Workload architecture
When it comes to Reliability, you might have read about Multi Availability Zones or even Cross Region Failure Architectures.
Well this is an really important issue, but let's focus on a even worse and more likely failure scenarios.
With Multi AZ or Cross Region we want to protect our infrastructure from a failure of
a AWS Availability Zone or a complete AWS region.
If there is no WW3 in near future, a meteor hits the earth or Godzilla appears, this is not the most likely scenario to happen.
What is much more likely is the failure of an single server . in particular if it is a windows server.
Do you have single point of failures in your infrastructure?
Many services on AWS are resilent by design on AWS, like for example S3.
S3 is not implemented with an single server storing your data on its disk, it is designed Mult AZ with an Availability of 99,99%.
Same for Dynamo, Lambda, Loadbalancers are not just single servers, they are a managed service, which will continue to work,
a NAT Gateway is a managed service with the same architecture, not comparable with a single NAT instance.
Clusters services are save, talking about reliablity and Availability, for example ECS, ElastiCache and more.
Lets focus on EC-Instances and RDS Instances (which are basically also just dedicated EC2 instances).
If your have your flagship application running on just a single EC2 instance, this cannot be considered a
reliable and high available architecture.
Also if you just have a single database instance without multi az or a standby or reader instance.
You have to take an action here.
For EC2 exists a great feature : Autoscaling Groups and Launch Templates.
Instead of a single EC2 instance you define a Launch Template. You can create a Launch Template from your exising EC2 instance.
Now you define a Autoscaling Group for this Launch Template, which will take care, there will be always instances
running for there 'Desired Capacity'.
If you just need one instance, no problem, you can set 'Desired Capacity' to 1.
Your costs will not increase.
But your Reliability and Availability.
For Database Instances think about setting up a cluster instead of a single instance.
Make up your choice if you need a read replica or Multi AZ, this depends on your use case .
Multi AZ is for High Availability and Reliability, a Read Replica serves more to increase your Performance on Database Reads.
But it will also increase Reliability, because you can promote the Read Replica as new Master in case of an incident.
No usage of IaC (RTO)
Are you already using CloudFormation, CDK or Terraform ?
Perfect, then you can skip this chapter.
Just be aware of one point.
Drifts !
If you are using IaC, use it strict.
I have seen many times, customers use for example CloudFormation, but in time-critical situations,
they fix or change something on the Web-Console and don't include these changes into their IaC files.
Now your IaC is no longer in sync with your deployed infrastructure and this is a real problem.
For CloudFormation for example this is called a Drift.
First, your IaC represents no longer your real infrastructure.
It might be correct and safe, if you check your code, everything looks good,
but someone openend the Database Security Group on the WebConsole, so your real infrastructure is not safe !
Second, these drifts can cause serious problems when updating a CloudFormation Stack ( or trying a Terrafrom apply).
Your Stacks might get stuck, in worst case, where you have no chance to update or rollback again, the
only remaining solution will be destroying the Stack and redeploy.
If your are not using IaC and do anything on the WebConsole manually, you have a problem.
Use a IaC Tool immediatelly.
If you have no clue which one to use, take a look into this blog:
Infrastructure as Code
Why should you use IaC ?
Well for a POC or just trying out, if your application would work also with DynamoDB, the WebConsole is fine,
to achieve quick results.
You don't have to code this, if you do not yet know, you will use it in production. Perfect.
But your productive environment ?
You must be able to re-create it quickly after an incident.
If you have created anything on the WebConsole, probably no documentation, or maybe somebody else did it,
who is no longer in the company, you will have a lot of fun, to re-create your infrastructure in this critical situation,
where any second is important.
We are talking here about RTO (Recovery Time Objective) or how quickly after an outage an application must be available again.
With CloudFormation, Terraform or CDK and Pipeline to deploy them, your RTO might be 1-2 hours ( including restoring a
database snapshot and re-deploying your application code).
Without them, your RTO might be days or weeks.
Can you deal with this?
If not, please use IaC .
No Monitoring, Alarming and Logging
“Everything fails all the time” (Werner Vogels, CTO AWS)
Also to mention, many systems on AWS does not have setup Monitoring, Alarming and Logging properly.
Are you logging your important servers to cloudwatch or S3 ?
The Cloudwatch Log Agent is really easy to setup for an EC2 instance.
RDS has native Log Features, most of the services, but they have to be enabled and setup.
There a VPC FlowLogs you can setup and a lot more.
Do you have CloudTrail setup in your account? To monitor and record your activities in the account?
Do you monitor your critical systems and get SNS notifitations if a server becomes unhealthy or
do you wait until the customers will call you, to tell you , your site is unavailable?
Cloudwatch is the tool to use, to setup alarms and notifications.
Use health checks and SNS notifications to get informed about failures.
And to avoid these situation before failure, implement the hints mentioned on Reliability and Workload architecture.
Diverse other issues
These were the 5 top common problems but there are more points to observe.
I will not address them here in detail, but also be take a look on :
Automation and CI/CD
Of course you can create you Lambda Zip in your IDE or your Elastic Beanstalk Application Zip.
You can also build your docker containers for your ECS/EKS services on your commandline and push them
manually to ECR.
But you shouldn't .
The only guy in the company with the knowHow to do it, might be on holiday, when the disaster occurs.
You have to redeploy your app code but you have no idea how to build and upload it.
If you have your CI/CD pipelines prepared, you just have to press one button.
You should have Pipelines for your IaC. Either it is CloudFormation, Terraform or CDK.
Same for your application code, Lambda Code and Docker containers.
Any tool will do. AWS CodePipeline, Azure Devops, Gitlab, Jenkins.
Your choice. But use one of them .
Unused resources running
Often you found resources running, which are no longer in use.
Setup by a developer on the Webconsole for a POC or a test and not cleaned up.
These resources will cause costs and even worse might be a security risk, if for
example an EC2 instance with a open Security Group.
Clean you account !
Delete everything what is obsolet. And take care it doesn't happens again.
Best practice is to tag everything which is really needed.
Make at least a 'Owner' and a 'Project' tag.
This helps also identifying the responsable in cause you don't know whom to ask.
Project Tag can be used for cost allocation tags and cost separation.
Anything not tagged can and should be deleted.
If you want to automate this, a Lambda could do the job.
Missing cost optimization and cost monitoring
Many companies waste a lot of money by not optimizing the costs on AWS.
Please have a look into this blog, to check if you do everything right .
Cost Optimization Best Practices on AWS
Conclusion
Did you check your AWS account and architecture and you found non of the mentionend issues ?
Perfect, then you don't to worry and you or your implementation partner did a great job.
If you have any of these problems in your account, you have to fix this. Urgently .
Find someone with the knowHow to do it right, of course you can also contact us,
we would proudly help to get your architecture save and up to date with all best practices .
Contact us