Analysis: Rethinking cloud architecture after the outage of Amazon Web Services

Editor’s Note: Brian Guy is Director of Cloud Services for Dev9, a custom software development firm focused on cloud services.

Thanks to innovation from companies such as Amazon with AWS, Microsoft with Azure, and Google with Google Cloud Platform (GCP), organizations of all sizes are today increasingly more agile and competitive. Cloud provider partners like Dev9 enable organizations to optimize their journey to the Cloud.

But on Tuesday, February 28, 2017, many people found that their smart phone applications were no longer working properly, many web sites were down and the Internet in general just seemed broken. This is what happens when AWS, the largest Cloud provider, experiences a “service disruption.”

What makes this past week’s outage unique is that unlike prior outages, “service disruptions” or “service events” as Amazon calls them, this week’s web site outages and mobile application failures were not the result of organizations not following Amazon’s best practices, otherwise known as the “Well-Architected Framework.”

In prior AWS outages, such as the 2016 “Service Event in the Sydney Region” where an entire Availability Zone (AZ) failed, organizations that followed Amazon’s Well-Architected best practices were not negatively impacted. This 2017 outage will no doubt cause Amazon to reassess its Well-Architected Framework and introduce new best practices focused on S3 availability.

Indeed, even the AWS Service Health Dashboard (SHD) itself was impacted due to its dependency on S3 in a single region. Amazon has now re-architected its dashboard to be multi-region.

What Is the AWS Well-Architected Framework and Why Does It Matter?

The success of AWS is largely dependent on the success of its customers. If customers do not architect and implement optimally, it hurts the reputation of AWS. If customers have outages, poor performance, very high spend or security issues on AWS, this similarly hurts AWS.

To help its customers and by extension help itself, AWS introduced the Well-Architected Framework on October 2, 2015. The AWS Well-Architected Framework initially focused on four pillars:

Security
Reliability
Performance Efficiency
Cost Optimization

In November 2016, after a year of thousands of reviews carried out by AWS Solutions Architects, a fifth pillar was added:

Operational Excellence

For startups and smaller organizations that do not have a significant investment in on-premises hardware, deploying to the Cloud is often the default decision in 2017. The agility and flexibility of Cloud computing, combined with the lower startup costs, help these organizations spend less time and money on infrastructure and computing resources.

For large enterprises, however, deploying to the Cloud is a significant incremental cost until redundant on-premises resources are retired. This process can take years, resulting in higher costs in the near term over 3-5 years (or more).

Moreover, the paradigm shift from capital expenditures to higher operating expenses (CapEx to OpEx) can face internal political pressure that can slow down the entire process of migrating and modernizing legacy applications, while also innovating new applications in the Cloud.

Smaller, more nimble competitors instantly have access to the global infrastructure of AWS or Azure, thereby eliminating a prior significant barrier to entry.

These new competitors force enterprise customers to adopt the Cloud (despite it being an incremental cost in the near term) in order to innovate and obtain the agility required to effectively and efficiently compete in 2017. Amazon’s Well-Architected Framework helps these organizations set up for success in the Cloud.

What is Amazon S3?

Amazon Simple Storage Service (S3) is object storage. In modern computing, storage is typically divided into being file level storage, block storage or object storage.

File level storage is found on Network Attached Storage (NAS) and typically works in conjunction with a protocol such as SMB (think Windows shares) or NFS (popular in Unix and Linux environments). Amazon Elastic File System (EFS), which is similar to NFS, and Azure File Storage, which uses SMB, are examples of Cloud-based file storage.

Block storage is what you find in your PC or local storage on a server, and it usually – but not always – includes a file system (e.g., NTFS, FAT32, ext3, Btrfs) on top. Some database servers, such as Microsoft SQL Server and Oracle Database, are capable of writing directly (referred to as a RAW partition) without needing the overhead of the file system on top.

In AWS, local instance storage (also called ephemeral storage) and Elastic Block Store (EBS) are examples of block storage. In Azure, Premium Storage is an example of block storage on SSD. Storage area networks (SANs) also use block storage.

Object storage, unlike file level storage and block storage, does not need to be accessed via an operating system like Linux or Windows. It can be accessed directly via APIs or via http(s), making it optimal for web applications.

As storage costs dropped, as megapixels on cameras and phones continued to increase and as users started wanting to store and share gigabytes and even terabytes of large objects, object storage met a need that is not efficiently met by block storage or file level storage. While block storage is excellent for operating system files, relational database records and Office documents, it is not optimal for a feature-length HD movie (think Netflix’s needs).

Object storage allows Netflix to store its movies, allows photo sharing sites to store your photos, allows music streaming services to store their songs, allows iCloud to store a backup of your iPhone, allows video game publishers to store their games for download…and much more.

Amazon S3 is Amazon’s object store. In Azure, this service is referred to as Blob Storage (blob = Binary Large Object).

Why Did So Many Things Break When Just Object Storage Had a Problem?

AWS promotes a best practice of moving all web static content – such as images and style sheets – off of more expensive EC2 instances and on to S3. Amazon EC2 (Elastic Compute Cloud) is Amazon’s name for a virtual server. EC2 expense is frequently a large portion of an organization’s AWS spend, so offloading work from EC2 to S3 can be a best practice for cost optimization.

For example, a web site running on a fleet of EC2 instances might link to S3 for all of its images and other static content, and it might rely on EC2 itself only for dynamic content creation. This removes load from the EC2 instances, thereby potentially decreasing the number of EC2 instances needed and/or decreasing the size and specifications of the EC2 instances.

Similarly, if a content delivery network (CDN) such as CloudFront, Azure CDN or Akamai can pull static assets off of S3 instead of from EC2 instances, this reduces load on more expensive virtual servers.

In addition, for static web pages that only have client-side scripting and do not need server-side dynamic content, the entire page can be hosted on S3. In other words, S3 can even act as a simple web server, completely removing the need for any EC2 instances.

Lastly, many AWS services are dependent on S3. Therefore, when S3 is down, other AWS services may not work as expected. According to Amazon, “Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.”

The US-EAST-1 (N. Virginia) region is one of the most heavily-used regions in the AWS global infrastructure, so the outage occurring in this region likely impacted a higher number of customers than if the outage had occurred in a different, smaller region.

Why Weren’t Some Sites and Applications Impacted?

It is misleading to state that AWS was down or that the Internet was down when in fact a single AWS service in a single region was experiencing a disruption.

This is a trend where the media might report the outage being larger than it is, primarily due to so many popular web sites and mobile applications being impacted. The fact is that an optimal multi-region architecture would not have been impacted by Tuesday’s S3 outage.

When a single Availability Zone (AZ) in the Sydney region experienced an outage in 2016, optimally architected sites and applications continued to run as designed. But popular sites and applications that were not designed optimally indeed triggered reports that all of Amazon – even that the Internet itself – was down in Australia.

In addition, sites and applications that use S3 in a different region were not impacted since the outage was isolated to a single region. And sites and applications that do not rely on S3 were not impacted, unless they relied on another AWS service that was impacted by the S3 outage.

What Happened? What Caused This Rare S3 Outage?

Amazon touts S3 has having an impressive 11 nines (99.999999999%) of durability, so how could this happen? And what does 11 nines really mean?

We first need to differentiate durability versus availability. Durability refers to the lack of data loss, whereas availability refers to the data being available.

Eleven nines of durability effectively means you are very unlikely to lose any data on S3 – even if you use S3 in a single region – if you choose the Standard storage class.

The Reduced Redundancy storage class decreases durability to 99.99%. It is assumed there was no data loss in the February 2017 S3 outage. Amazon disclosed that a small number of customers lost data from EBS volumes during the Sydney outage in 2016.

While durability refers to the risk of data loss, availability refers to the risk of an outage or the service not being available. S3, when used in a single region, is “designed for” 99.99% (4 nines) of availability and includes a service level agreement (SLA) for 99.9% (3 nines) of availability.

By comparison, EC2 provides an SLA of 99.95% (what I call 3½ nines) when deployed to at least two Availability Zones (AZs).

High Availability and All Those Nines

Three nines (99.9%) of uptime means that the service can be unavailable no more than 8.76 hours over the course of a year. Four nines (99.99%) means it can be unavailable no more than 52.56 minutes per year. Five nines (99.999%) means 5.26 minutes per year.

Note the significant difference of about 9 hours versus about 5 minutes when comparing 3 nines to 5 nines. That extra 0.009% adds significant cost and complexity to a solution.

This outage was caused partially by human error and partially by not doing regular testing of system restarts. Amazon’s postmortem indicates it had been several years since the S3 systems in this region had been restarted. Testing restarts is a fairly common best practice but also a fairly common source of problems when not conducted on a regular basis.

It is of course preferred to identify post-restart problems as a part of a scheduled test, for example after applying updates, and not after an unexpected or unplanned restart.

Some headlines blame the outage on a simple typo, but of course the problem is much more complex and involves process, testing, scripting, fault domains and customers employing multi-region redundancy when appropriate.

A more detailed postmortem explanation of what happened is available here.

It is worth noting that AWS is consistent in sharing its findings a few days after an incident. This transparency benefits the entire cloud community and helps everyone to architect for inevitable incidents.

S3 in Multiple Regions as a Best Practice

Historically, Amazon’s position has been that multi-region architectures are not necessary for a Highly Available (HA) solution. Instead, the advice has been to deploy a Virtual Private Cloud (VPC) to at least two Availability Zones (AZs) within a region.

AZs are located miles apart from each other, are on separate flood plains and have redundant power and Internet connectivity.

In last year’s Sydney outage, one AZ became unavailable during an extreme weather event. Customers in Sydney who architected to multiple AZs as Amazon recommends remained up. Customers who relied on a single AZ and had selected the impacted AZ experienced downtime.

But S3 does not reside within an AZ. With S3, you only select a Region.

You can indeed have an S3 endpoint within an AZ, but your data in S3 reside within a Region, not within an AZ. It can therefore act as a potential single point of failure (SPOF) when only a single region is utilized.

So far, Amazon has not prescribed multi-region S3 implementations as a part of its “Well-Architected Framework” in order to eliminate this SPOF. Perhaps because of S3’s track record of stability and availability, architects have generally been complacent with using S3 in a single region.

Multi-region implementations of any service introduce cost and complexity, and high availability has always been a balancing act of trading off expensive complexity against an additional 9 of availability. But the fact remains that S3 has a 99.9% (3 nines) SLA, so some added complexity may be required if your business requirements exceed 3 nines. If your application is largely based on S3 – a photo sharing web site or mobile application, for example – introducing a second region for S3 could double your storage costs.

But the added storage costs may be required in order to reduce risk of an outage. Perhaps only a subset of your object data need the higher availability provided by a second S3 region given the significant durability offered by a single S3 region.

Because its own dashboard was also impacted, Amazon itself is now introducing multi-region S3 for its AWS Service Health Dashboard (SHD). We can expect to see new guidance introduced in a revision to the Well-Architected Framework as a result of this outage.

S3 as a SPOF Problem

An infrastructure is only as good as its weakest link, and many implementations today include S3 as a single point of failure. This is sometimes on the advice of Amazon to move static web sites and static content to S3, but there is currently no recommendation or prescribed best practice to make S3 multi-region.

Based on customer feedback, Amazon is rolling out more multi-region functionality, and it is indeed now simpler (but not necessarily inexpensive) to replicate S3 content to another region. But the primary message from AWS remains that deploying to multiple Availability Zones within a single region is sufficient.

Then, use a content delivery network (CDN), such as CloudFront, to get that content closer to end users. The fact that this does not address the high availability of S3, and the fact that S3 remains a single point of failure, remains an outstanding risk today.

Another explanation for many organizations keeping S3 in a single region is that there can be significant legal considerations when replicating customer data across countries. While it is straightforward to replicate data within a country, it is more complex to replicate data out of a country.

For example, some countries do not allow certain data to enter the United States. Replicating data across regions – especially across geopolitical boundaries – typically requires assistance from legal counsel. Corporate counsel is not known for its agility or fast pace.

With All This Info, What Can You Do?

Option 1: Do Nothing; It’s Good Enough

One valid option is to do nothing, and keep S3 in a single region. So many sites and applications go down when AWS has an outage that your customers will forgive you as soon as they realize much of the Internet is down. Just be sure to deploy VPC services such as EC2 to at least two Availability Zones, since social media is unforgiving to organizations that only deploy to one AZ or otherwise violate well-known best practices (see sample tweets in our seminar about the Sydney outage, referenced above).

But do consider at least replicating data to another region for disaster recovery purposes, even if that data will not be accessed by your application. A best practice is for the disaster recovery region to be in a completely different AWS account in order to minimize the “blast radius” if your primary account is compromised.

Option 2: Multi-Region S3 and DNS with Health Checks

It is fairly straightforward to replicate your S3 objects to a second (or third) region (see instructions here for how to do this), but then how do you get your application to only utilize an S3 region that is healthy? How do you handle an outage like the one that occurred in February?

One option is to update your application logic (but this can be expensive), and you may have application code that is simply too risky to modify.

In these situations, DNS is your friend. DNS (Domain Name System) is the magic that converts an address like “www.google.com” into an actual IP address of 216.58.216.164, so that you can use friendly names instead of actual IP addresses. More importantly, this also allows the underlying IP addresses to change without breaking the application.

If your application points to a DNS address such as photos.objectstore.mycompany.com in order to load or save your user’s photos, for example, then this address can resolve to one or more S3 regions.

Which S3 regions are actually used when retrieving photos can be determined by Health Checks. Some DNS services – like offerings from AWS and Azure – can automatically remove unhealthy destinations if they do not pass a health check. For example, the N. Virginia region of S3 could have been automatically removed during its outage, and applications could have diverted to a different region instead.

Health Checks are not supported by all DNS servers, so multiple options are discussed below.

If you are using Amazon Route 53 as your DNS solution, this is as simple as configuring Health Checks and DNS failover as described here.

Microsoft Azure has a similarly simple solution, and this feature is broken out into a separate service called Azure Traffic Manager. Similar to Route 53, you can configure Azure Traffic Manager to only send traffic to healthy endpoints. Since Azure Traffic Manager supports non-Azure endpoints, it should (in theory, not yet tested) be possible to have Azure Traffic Manager manage multiple Amazon S3 endpoints (in the case of a multi-cloud environment), only sending traffic to healthy S3 regions. Details on Azure Traffic Manager can be found here.

If you are using an older DNS solution, such as BIND, that doesn’t support Health Checks, then there are a few options.

The first option is very simple but not ideal: implement DNS Round Robin between multiple S3 regions, and then manually remove the IP address for an unhealthy S3 endpoint when that region fails.

This is not technically Highly Available since manual intervention is required, and it is not a best practice. But it is an option that can minimize downtime versus having everything in one S3 region. In this scenario, it is important to have low TTL values for DNS caching.

This option may be appropriate for a hobby web site or other web site where you do not want added complexity and you are comfortable simply editing the DNS records when an outage occurs. This option is not appropriate for production web sites or applications that require true high availability with automated failover.

With this option, there may still be a small amount of downtime for a subset of users (the subset who received the IP address of the failed region until you remove it), but it is better than all users being pointed to a single region that is down.

Another option is a third-party solution that does health checks and sits in between your DNS and the application. This would be similar to how Amazon’s Elastic Load Balancer (ELB) can route traffic to only healthy EC2 instances. ELB does not currently support S3.

HAProxy is a popular choice for on-premises solutions and could be adapted to manage traffic to S3 regions. However, it would be important not to make this layer a new single point of failure. Over time, we may see Amazon evolve ELB to also support multi-region S3.

Option 3: Select a Lower Risk Region

Unless you have a need to use the US-EAST-1 (N. Virginia) region, use US-WEST-2 (Oregon) instead. There are indeed valid reasons to select US-EAST-1, and I frequently advise customers to choose this region, but if there is no need, then default to Oregon.

Amazon’s position is that all regions are equal, and my recommendation is not an Amazon stated best practice. My recommendation is based on my own observations working with AWS since 2008

Conclusion

It will be interesting to see what Amazon recommends now that S3 has presented itself as a single point of failure not currently addressed in Amazon’s Well-Architected Framework. Amazon itself has made an architectural change and enhanced its status page to now be multi-region given its own outage due to the S3 single point of failure.

Amazon moves quickly, so we can expect some new guidance and an update to the Well-Architected Framework sooner rather than later.

While learning from this outage and from prior Cloud outages, it is worth calling out the relatively rapid time to resolution and the significant army of resources that Amazon or Microsoft can deploy to a problem.

These are resources you do not directly pay for, and the depth and breadth of these resources are not resources you can employ in-house, unless perhaps you are Facebook or Google. Most readers of this document are not Facebook or Google and significantly benefit from the resources and expertise that Amazon or Microsoft provide you with when you embrace their Cloud platform.

Despite the incidents that have occurred and will continue to occur, the major Cloud platforms are still significantly more reliable, significantly more secure, and in the long run, more cost effective than on-premises architectures.

These outages should not cause pause to Cloud adoption but rather should highlight the fast time to resolution and the Cloud expertise your organization gains when partnering up with one of the major Cloud providers.

Additional Reading

Jeff Barr’s blog post, “Are You Well-Architected?”

Jeff Barr’s blog post one year later, “Well-Architected, Working Backward to Play it Forward” announcing the new Operational Excellence pillar.

The AWS Well-Architected Framework whitepaper.

The “Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region”

The “Summary of the AWS Service Event in the Sydney Region”

Analysis: Rethinking cloud architecture after the outage of Amazon Web Services