Amazon details cause of AWS outage that hobbled thousands of online sites and services

A past AWS re:Invent conference. (GeekWire Photo)

A “relatively small addition of capacity” to the Amazon Kinesis real-time data processing service triggered a widespread Amazon Web Services outage last week, the company said in a detailed technical analysis over the weekend.

The addition “caused all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration,” the post said, describing a cascade of resulting problems that took down thousands of sites and services.

The outage impacted online services from big tech companies such as Adobe, Roku, Twilio, Flickr, Autodesk, and others, including New York City’s Metropolitan Transit Authority. The Washington Post, which is owned by Amazon CEO Jeff Bezos, was also impacted by the outage.

It was an especially ill-timed incident for Amazon, coming just days before its annual AWS re:Invent cloud conference, which kicks off Tuesday morning as a virtual event. Reliability has been a hotly debated topic between Amazon, Google, Microsoft and other major players in the cloud, each of whom experiences periodic outages.

The explanation underscores the interdependent nature of cloud services, as the problems with Kenesis impacted the Amazon Cognito authentication service, CloudWatch monitoring technology, Lambda serverless computing infrastructure, and other Amazon services.

“In the very short term, we will be moving to larger CPU and memory servers, reducing the total number of servers and, hence, threads required by each server to communicate across the fleet,” the company said, describing one of the lessons learned from the incident. “This will provide significant headroom in thread count used as the total threads each server must maintain is directly proportional to the number of servers in the fleet.”

Amazon apologized and said it would apply lessons learned to further improve its reliability: “While we are proud of our long track record of availability with Amazon Kinesis, we know how critical this service, and the other AWS services that were impacted, are to our customers, their applications and end users, and their businesses. We will do everything we can to learn from this event and use it to improve our availability even further.”

Amazon details cause of AWS outage that hobbled thousands of online sites and services

Most Popular on GeekWire

Job Listings on GeekWork

Related Stories

Most Popular on GeekWire

Job Listings on GeekWork