What happened at Amazon Web Services to take down the Netflix streaming service for hours on Christmas Eve? It was a combination of human error and flawed access controls: An AWS developer was able to accidentally delete a key data set in the AWS Elastic Load Balancing Service, causing the disruption.
That’s the word from AWS in a detailed post that describes what happened inside the Amazon operations on Christmas Eve.
The post apologies for the outage and explains the root cause: “The data was deleted by a maintenance process that was inadvertently run against the production ELB state data. This process was run by one of a very small number of developers who have access to this production environment. Unfortunately, the developer did not realize the mistake at the time.”
Translation: Somebody at Amazon Web Services did not have a very good Christmas Day.
The outage has received attention in part because Amazon separately competes with Netflix through its Instant Video streaming service, which remained online.
The Amazon post, which doesn’t mention Netflix by name, explains the steps that Amazon is taking to prevent similar situations in the future, including tightening access to its live data to prevent inadvertent changes. The company says it will also be able to move faster to fix similar mistakes next time, if there is one.
“Last, but certainly not least, we want to apologize,” the post says. “We know how critical our services are to our customers’ businesses, and we know this disruption came at an inopportune time for some of our customers. We will do everything we can to learn from this event and use it to drive further improvement in the ELB service.”
In a separate post, Netflix cloud architect Adrian Cockcroft says Netflix is also looking to avoid a repeat. He writes …
It is still early days for cloud innovation and there is certainly more to do in terms of building resiliency in the cloud. In 2012 we started to investigate running Netflix in more than one AWS region and got a better gauge on the complexity and investment needed to make these changes.
We have plans to work on this in 2013. It is an interesting and hard problem to solve, since there is a lot more data that will need to be replicated over a wide area and the systems involved in switching traffic between regions must be extremely reliable and capable of avoiding cascading overload failures.
All in all, not the way that Netflix or AWS would have wanted to wrap up 2012.