AWS Blog Wolfgang Unger

Reliability

from the 5 pillars of the Well architected Framework

By Wolfgang Unger

What means Reliability from the 5 pillars of the Well architected Framework exactly?
Lets have a closer look what Reliability means in the context of AWS or cloud computing.
The AWS documentation describes Reliability like this:
"The reliability pillar encompasses the ability of a workload to perform its intended function correctly nd consistently when it’s expected to.
This includes the ability to operate and test the workload through its total lifecycle.
This paper provides in-depth, best practice guidance for implementing reliable workloads on AWS"

There are 4 maior topics to look in detail:

Shared Responsibility Model for Resiliency

Design principles

Definitions

Understanding availability needs

Shared Responsibility Model for Resiliency

Resiliency is a shared responsibility between AWS and you.
It is important that you understand how disaster recovery (DR) and availability, as part of resiliency, operate under this shared model.

AWS responsibility - Resiliency of the cloud
AWS is responsible for resiliency of the infrastructure that runs all of the services offered in the AWS Cloud.
This infrastructure comprises the hardware, software, networking, and facilities that run AWS Cloud services.
AWS uses commercially reasonable efforts to make these AWS Cloud services available, ensuring service availability meets or exceeds

Customer responsibility - Resiliency in the cloud
Your responsibility is determined by the AWS Cloud services that you select.
his determines the amount of configuration work you must perform as part of your resiliency responsibilities.
For example, a service such as Amazon Elastic Compute Cloud (Amazon EC2) requires the customer to perform all of the necessary resiliency configuration and management tasks

shared-model-res

Design principles

In the cloud, there are a number of design principles, which can help you increase reliability.
Lets have a look on 5 of them:

Automatically recover from failure:
By monitoring a workload for key performance indicators (KPIs), you can run automation when a threshold is breached.
This allows for automatic notification and tracking of failures, and for automated recovery processes that
work around or repair the failure. With more sophisticated automation, it’s possible to anticipate and remediate failures before they occur.

Test recovery procedures:
Testing is not typically used to validate recovery strategies. In the cloud, you can test how your workload fails, and you can validate your recovery procedures.
You can use automation to simulate different failures or to recreate scenarios that led to failures before.

Scale horizontally to increase aggregate workload availability:
Replace one large resource with multiple small resources to reduce the impact of a single failure on the overall workload.
Distribute requests across multiple, smaller resources to ensure that they don’t share a common point of failure.
If these instances are autoscaled by metrics like CPU or others, this will also help you economize money.

Stop guessing capacity:
In the cloud, you can monitor demand and workload utilization, and automate the addition or removal of resources to maintain the optimal level to satisfy demand without over- or under-provisioning.
There are still limits, but some quotas can be controlled and others can be managed.
You do no longer need to do the guesswork of planing your capacity.

Manage change through automation:
Changes to your infrastructure should be made using automation. The changes that need to be managed include changes to the automation, which then can be tracked and reviewed. You should set up a pipeline for your Deployment of IaC with CloudFormation or CDK.

Definitions

This whitepaper covers reliability in the cloud, describing best practice for these four areas:

Foundations

Workload Architecture

Change Management

Failure Management

Take a look in the PDF for more details:
Definitions

Understanding availability needs

Most applications today aren’t monolithic systems with a single availability target.
Generally speaking, you can break an application down into several components, each with different availability requirements.
The microservice architecture established in the last years replaced the monolitic desing pattern.
By breaking your systems down into these component pieces, you can better understand your availability needs, and design your systems to meet those targets in a cost-effective way.
Many AWS customers take a similar approach to critically evaluating their applications and identifying subcomponents with different availability needs.
Availability design goals are then tailored to the different aspects, and the appropriate work efforts are performed to engineer the system.
AWS has significant experience engineering applications with a range of availability design goals, including services with 99.999% or greater availability.

Back to Blog Overview

Reliability

from the 5 pillars of the Well architected Framework

By Wolfgang Unger

Shared Responsibility Model for Resiliency

Design principles

Definitions

Understanding availability needs

Autor

Wolfgang Unger