Amazon.com has issued a formal apology to customers who suffered through last week’s Elastic Block Store outage, offering a 10-day credit to customers whether they were affected or not. Is that good enough to restore confidence in the cloud computing services from Amazon? It’s probably too early to say, but Amazon’s post-mortem is a step in the right direction.

The company writes in a long explanation:

“We know how critical our services are to our customers’ businesses and we will do everything we can to learn from this event and use it to drive improvement across our services. As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes.”

One of the biggest criticisms of Amazon.com during last week’s outage was the lack of transparency about what was happening. That was highlighted in a guest post on GeekWire from BigDoor CEO Keith Smith titled: “Amazon.com’s real problem isn’t the outage, it’s the communication.”

Maybe Amazon was reading (or at least listening) to the outrage of customers because the company says it plans to improve the communication flow when problems occur.

“In addition to the technical insights and improvements that will result from this event, we also identified improvements that need to be made in our customer communications. We would like our communications to be more frequent and contain more information. We understand that during an outage, customers want to know as many details as possible about what’s going on, how long it will take to fix, and what we are doing so that it doesn’t happen again.”

The message also goes into great technical detail on what went wrong. Here’s just a part of the explanation.

Two factors caused the situation in this EBS cluster to degrade further during the early part of the event. First, the nodes failing to find new nodes did not back off aggressively enough when they could not find space, but instead, continued to search repeatedly. There was also a race condition in the code on the EBS nodes that, with a very low probability, caused them to fail when they were concurrently closing a large number of requests for replication. In a normally operating EBS cluster, this issue would result in very few, if any, node crashes; however, during this re-mirroring storm, the volume of connection attempts was extremely high, so it began triggering this issue more frequently. Nodes began to fail as a result of the bug, resulting in more volumes left needing to re-mirror. This created more “stuck” volumes and added more requests to the re-mirroring storm.

Not sure I can make sense out of that, but maybe some computer scientists in the audience can offer more insights. Full message here.

Previously on GeekWire: “Jeff Bezos to shareholders: Invention is in our DNA”

Comments

  • Guest

    Congratulations and thank you to Amazon for providing this thorough explanation. I hope commentators will be cool with the fact that it’s too long for a tweet.

    • victor

      Is this a joke? Why do people give congrats to just about every story this site posts?

  • http://www.facebook.com/smithkl42 Ken Smith

    After you wade through the details in their post-mortem, it sounds like the problem was, effectively, a self-denial-of-service. In other words, they got themselves into a position where the traffic load prevented them from being able to easily execute the changes necessary to recover from an initial minor hiccup. I’ve had that happen on my watch before, and it’s not fun. It’s also hard to prevent, especially when you’re dealing with the insanely large traffic volumes that they are.

    They seem to have learned almost all the lessons you’d hope. They’re doing their best to address problems that were exposed at every level of the stack, from the network architecture, up through the control plane and continuing on to the processes and procedures that they had in place to deal with outages — and even with how their own customers have architected their solutions. Amazon has identified changes they’re going to make at each of those levels, and that’s the right thing.

    I’m glad that they aren’t standing on their silly SLA, as under its terms, this technically wasn’t an outage. But that does point to the one major element missing from their post-mortem: I didn’t see anything about changing the actual terms of the SLA, so that EBS and RDS outages are included, instead of just EC2 outages. The fact that Amazon had to ignore its SLA in this instance is pretty good evidence that it’s insufficient, and I think their customers will likely be demanding something more comprehensive.

    Going beyond that, my suspicion is that despite the many lessons learned in this instance, there will be more cloud outages in the future. It’s simply very difficult to do what they’re doing, and subtle code and configuration changes can often have butterfly-wing effects that don’t become apparent until you move them into production and place them under significant load. Hopefully those outages will have less impact in the future, but I have to imagine that we will see more of them. You can make your architecture as redundant as possible, but there will always be two distinct, single points of failure in the most redundant system: (1) the code that it runs on, and (2) the configuration data used by that code. You simply can’t get rid of those two single points of failure, no matter how hard you try.

  • http://twitter.com/ChiefDoorman Keith Smith

    My sense is that catastrophic events like this tend to demonstrate to the outside world what the true culture of the company is. If the company has a culture of ignoring problems and only focusing on the good stuff they do, then that will likely become apparent. (Amazon could certainly take this approach because AWS makes up a very small piece of an otherwise mostly very healthy business – so this issue could easily be swept under the rug and forgotten about.) However, companies with winning cultures turn catastrophes into seminal and defining moments, and use the learnings from bad things to make huge improvements and advances.

    Today’s apology and public explanation of the “Great Re-Mirroring Storm of 2011″ gives me hope that Amazon still has a winning culture at their core.

    I was very glad to hear Amazon address not only their technical issues but also tackle their abysmal communication throughout the outage. I’ve been personally involved in (and responsible for) far more outages than I like to admit, so I have a good sense for how difficult it is to communicate clearly during triage moments. Here are five little tips for the good folks at AWS regarding technical, crisis communication that I’ve picked up along the way:

    1. Put someone’s name on the communication. When we have a big outage (we had one last year that was super painful), I make sure that I send communication to our partners with my name on it. Rather than hiding behind the comfy anonymity of corporate communication, make it personal, have someone senior take responsibility for the communication and sign it. Yes it is uncomfortable to do so, but it will force you to be human and help you focus on trying to be accurate and helpful.

    2. If you don’t understand the gravity of the situation, say so. Customers certainly need to know that you have a good handle on things, but they will be (mostly) understanding if for brief periods of chaos you admit that you don’t have a complete handle on the situation. This is far better than confidence that is proven false in time.

    3. It’s okay to be optimistic about when you think you’ll be back up as long as you caveat the optimism and handicap it. In other words, if you think there is a 60% chance of being back up in two hours – then tell your customers exactly that. The old airline trick of, “it’ll be another 30 minutes” that gets relayed every 30 minutes while waiting on the tarmac for hours only makes your customers think you have no respect for their intelligence.

    4. Give more details than you think are necessary, and give your partners a channel to ask questions. With the AWS outage, many customers could have brought servers back up if they had insight into the “stuck volume” issue. This is far preferable to having customers just sitting around and waiting.

    5. Triage during the outage, and keep your communication focused on anything and everything that will help put the “fire” out. Your customers aren’t thinking about your SLA during the outage – so you shouldn’t be either. The thing that struck a negative nerve with many AWS customers several hours into the outage was that it became apparent that the folks communicating about the outage were already thinking about their SLA while their customers just wanted to have service restored.

    We remain huge fans of AWS and we will continue to host the bulk of our systems with them for the foreseeable future. These guys are working hard and innovating their platform rapidly. We applaud that, and are confident that their communication will improve at the same speed.

    –Keith

Job Listings on GeekWork