Amazon.com has issued a formal apology to customers who suffered through last week’s Elastic Block Store outage, offering a 10-day credit to customers whether they were affected or not. Is that good enough to restore confidence in the cloud computing services from Amazon? It’s probably too early to say, but Amazon’s post-mortem is a step in the right direction.
The company writes in a long explanation:
“We know how critical our services are to our customers’ businesses and we will do everything we can to learn from this event and use it to drive improvement across our services. As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes.”
One of the biggest criticisms of Amazon.com during last week’s outage was the lack of transparency about what was happening. That was highlighted in a guest post on GeekWire from BigDoor CEO Keith Smith titled: “Amazon.com’s real problem isn’t the outage, it’s the communication.”
Maybe Amazon was reading (or at least listening) to the outrage of customers because the company says it plans to improve the communication flow when problems occur.
“In addition to the technical insights and improvements that will result from this event, we also identified improvements that need to be made in our customer communications. We would like our communications to be more frequent and contain more information. We understand that during an outage, customers want to know as many details as possible about what’s going on, how long it will take to fix, and what we are doing so that it doesn’t happen again.”
The message also goes into great technical detail on what went wrong. Here’s just a part of the explanation.
Two factors caused the situation in this EBS cluster to degrade further during the early part of the event. First, the nodes failing to find new nodes did not back off aggressively enough when they could not find space, but instead, continued to search repeatedly. There was also a race condition in the code on the EBS nodes that, with a very low probability, caused them to fail when they were concurrently closing a large number of requests for replication. In a normally operating EBS cluster, this issue would result in very few, if any, node crashes; however, during this re-mirroring storm, the volume of connection attempts was extremely high, so it began triggering this issue more frequently. Nodes began to fail as a result of the bug, resulting in more volumes left needing to re-mirror. This created more “stuck” volumes and added more requests to the re-mirroring storm.
Not sure I can make sense out of that, but maybe some computer scientists in the audience can offer more insights. Full message here.
Previously on GeekWire: “Jeff Bezos to shareholders: Invention is in our DNA”