Amazon.com’s real problem isn’t the outage, it’s the communication

Keith Smith

Guest Commentary: Like many companies running on Amazon’s Web Services, BigDoor has been affected by the AWS outage today.  And like most startups, we are braced for bad stuff to happen, and we do our best to learn from the painful stuff.  We spent a better part of the day in constant contact via Twitter, emails and phone calls apologizing and updating our more than 250 publishers that were affected. Today has provided plenty of lessons, and because transparency is fast becoming the lifeblood of Seattle’s startup community we thought we’d pass a few along.

At BigDoor, we made the decision early on to host everything we do with Amazon Web Services.  Generally speaking, we have been huge fans, and we regularly find ourselves singing their praises to anyone who will listen.  AWS has allowed us to scale a complex system quickly, and extremely cost effectively.

At any given point in time, we have 12 database servers, 45 app servers, six static servers and six analytics servers up and running. Our systems auto-scale when traffic or processing requirements spike, and auto-shrink when not needed in order to conserve dollars.

[Related: Day 2: Amazon cites progress on cloud glitches, not everyone seeing it]

In the ten months since we launched the public beta of our free, self-serve gamification platform we have handled over one billion API calls. Without AWS, that simply would not have been possible with our small team and limited budget.  Many others have realized similar benefits from the cloud, and AWS has quickly become a critical part of the startup ecosystem.

That’s not to say everything has been perfect.

The most notable ongoing issue we’ve experienced has been unreliable disk IO, specifically resulting in periodic slow disk writes.  This is an understandable technical complexity given the nature of cloud computing, but early on we realized that this problem was being seriously exacerbated due to Amazon’s unwillingness to publicly admit and discuss the severity of the issue.

We’ve managed to find workarounds to the technical challenges. But it was disconcerting to us that Amazon’s otherwise stellar system was being marred.  Not so much by a temporary technical issue, rather by what seemed like an unwillingness to embrace transparency.

Today that lack of transparency has continued.

As problems continued throughout the day, we experienced the obvious frustration from the system failure. But Amazon’s communication failure was even more alarming.

Our API, our main website and our publisher admin services have been almost entirely offline since around 1:30 a.m. PST, with brief periods of momentary availability. This has effectively rendered us powerless to service our customers.

Starting at 1:41 a.m. PST, Amazon’s updates read as if they were written by their attorneys and accountants who were hedging against their stated SLA rather than being written by a tech guy trying to help another tech guy.

We aren’t just sitting around waiting for systems to recover. We are actively moving instances to areas within the AWS cloud that are actually functioning. If Amazon had been more forthcoming with what they are experiencing, we would have been able to restore our systems sooner.

There are a lot of really obvious and relatively easy things that any startup can do to avoid an all-out reliance on any single cloud provider, but those things take additional time and money – two of the most important things that every startup is constrained by.

We absolutely love AWS because of the pace of innovation and scale that it has allowed us to accomplish. But after today’s episode is over, we will have a big decision to make.

We can spend cycles designing and building technical belts and suspenders that will help us avoid a massive failure like this in the future, or we can continue to rely on a single huge partner and also continue our break-neck pace of iteration and product development.

I can’t tell you today which option we will choose. But I’m sure it will be the question on the mind of many startups across the country.

If we come up with a good answer, we will be sure to be good members of Seattle’s startup ecosystem by being transparent and we’ll share our solution.  I encourage all other startups to do the same, and I hope this serves as a public request for Amazon to join the crowd.

Keith Smith is CEO of BigDoor, a Seattle startup that builds game mechanics into online publisher’s Web sites. You can follow the company on Twitter @bigdoormedia.

  • http://twitter.com/chrisamccoy Chris McCoy

    +1 Keith on putting this piece together.

  • http://twitter.com/ChiefDoorman Keith Smith

    Update: After trying for 16 hours we finally got someone within Amazon to speak candidly to us about what is going on. It was informative and is helping us recover more of our systems. This further underscores the point that they do have information that would be valuable to everyone that is down. Hopefully Amazon will start to let their tech folks write their status updates and keep their lawyers and accountants away from communicating with their customers.

  • http://www.facebook.com/smithkl42 Ken Smith

    +1 on keeping the lawyers out of customer communication during an outage.

  • http://www.centernetworks.com centernetworks

    This is such an interesting post to me – just yesterday I wrote about how Media Temple handled their outage and communications and how they did an excellent job at it. Here’s my post:
    http://www.centernetworks.com/media-temple-dns-downtime-issues

    Media Temple had their staff responding on Twitter and they posted a video in the middle of the outage to clearly explain what was going on.

    I think Amazon’s health dashboard is great- not sure if I agree totally that the updates read like lawyer’s notes but they spoke very technical – something I think we probably all wanted.

    Let’s see what happens now with their explanations about what happened today.

  • Anonymous

    O wow, OK that truly does make a lot of sense dude. WOw.

    http://www.complete-privacy.au.tc

  • http://www.facebook.com/Wemps Jeremy Wemple

    Thanks for taking the time to write that out. I’m interested to see what you guys decide upon to avoid this mess in the future. Keep us updated in the office – i know several companies there use AWS.

  • http://www.facebook.com/people/Jeff-Nolan/567846457 Jeff Nolan

    I realize that Amazon’s team was primarily focused today on recovering the service but hours would go by with no update. Get Satisfaction didn’t go down fully today but we did measure an appreciable service slowdown as a result of Amazon’s problems. I would have appreciated a steady stream of updates from Amazon even if for no other reason than to fill the vacuum that their silence created.

    Good post Keith, I would bet that a lot of companies will be having some serious discussions about strategic options in the days ahead.

  • Chintan

    Great Post! conveys the same feelings we went through all day. We’ve decided the latter approach to take matters in hands. Here is the mantra to help you make the decision “Customer always comes first” — think about it, if your customers aren’t happy, what good it will do to have the best scalable cloud infrastructure but nobody to use it.

  • http://twitter.com/duckyforce Duckyforce

    +1 on keeping the lawyers off the status page. For a great example of how to properly handle the communication of a widespread outage- see status.heroku.com

    Heroku was completely wiped out this afternoon, but they were still providing useful updates every half hour. That’s how it should be.

  • Anon

    you can fit your 64 servers in single ibm bladecenter with eight dual quad cores blades, and still have room for more. Whats more, they make a dual socket 10 core blade.

  • http://sco.tt Scott Yates

    I’m just glad that when Quora came back up the only question on it wasn’t: “Where is Sarah Connor?”

  • http://www.hicksnewmedia.com jameshicks

    and THIS transparency and willingness to share relevant information is EXACTLY why I am a client of Keith Smith and BigDoor!

    Well said my good man

  • http://barkles.com Diesel Laws

    Great post Keith! I too hope more companies explain their issues quickly and with a much more open and honest approach. Thank you for the information.

  • http://www.secondhack.com/ Madhav Tripathi

    So you are not happy at all with the Amazon AWS because there is no such support or communication.

  • http://www.facebook.com/profile.php?id=527736855 Mel Hamilton

    great piece there, a part of my website is down too. I use the free site About Me in the about me section. It’s still down. You can see it here

    http://www.risetv1.com

  • Bigpinots

    Thanks for this. We’ve just put a proposal together for a milo-national corporation with billion-dollar turnover, that recommended we test cloud ASAP (with a small site running on it). We also suggested that AWS be usef as the failover for their main sites in case of a severe outage at the dedicated server facility.

    I was feeling a bit disappointed that they weren’t more open to consider AWS as their primary server option, but understood their concerns due to lack of experience/track record with AWS.

    AWS’s lack of service in problem times seems to justify their feelings and gives me concerns over whether they should even be used for the failover option.

    Can you tell me what level of service option you’d taken with AWS?

  • Bigpinots

    Thanks for this. We’ve just put a proposal together for a milo-national corporation with billion-dollar turnover, that recommended we test cloud ASAP (with a small site running on it). We also suggested that AWS be usef as the failover for their main sites in case of a severe outage at the dedicated server facility.

    I was feeling a bit disappointed that they weren’t more open to consider AWS as their primary server option, but understood their concerns due to lack of experience/track record with AWS.

    AWS’s lack of service in problem times seems to justify their feelings and gives me concerns over whether they should even be used for the failover option.

    Can you tell me what level of service option you’d taken with AWS?

  • http://www.facebook.com/bill.harding2 Bill Harding

    Just posted a blog on this very topic that may be helpful for those evaluating their host: http://www.williambharding.com/blog/rants/ec2-vs-heroku-vs-blue-box-group-for-rails-hosting/

    Definitely would be interested to hear how other’s perceptions compare to mine.

  • http://profiles.google.com/mike.mainguy Mike Mainguy

    Aside from amazon’s lack of transparency, it’s pretty important to design for cloud computing. Don’t trust your cloud provider, or at best, trust but verify.

  • Schulzklaus

    Keith is the Man, Couldn’t have said it better…., Keith must have an Journalism Degree :)

  • http://twitter.com/rzeligzon Ron Zeligzon

    Spot On Keith! I think Amazon as an entity needs to be a bit more transparent, not just AWS. Lack of transparency nowadays, only hurts companies. Keith you shouldn’t be the one apologizing, Amazon should be the one apologizing.

  • http://freepository.com John Minnihan

    More than 24 hrs later, AWS is still working thru the issue(s). Clearly, this was a catastrophic, cascading type of failure that was either completely unthinkable (& thus had no contingency in place to handle), or the process(es) that did kick in simply failed (buggy, race condition, etc.).

    Planning for events like this is almost impossible. We may find out later that this was a human-error triggered event (why did an otherwise healthy subsystem fail diags?), but since the impact exposed so many of AWS’ customers to outages of their own, Keith + the rest of us are right to look at how we can continue to utilize AWS while minimizing risk.

    George Reese (@georgereese, founder of enStratus, author of O’Reilly’s Cloud Application Architectures) has some pretty good thoughts on this + can articulate them better than me, at least WRT using AWS + other cloud providers in a mixed-vendor architecture.

  • JO

    I have had zero problems with my server and EBS volumes on Amazon us-east-1. That said, it is very disconcerting to hear nothing from Amazon for the 1000′s of others who are having problems. Amazon support (which is usually good) abandoned the community support forums since yesterday mid day. The utter silence from Amazon is deafening….

  • srs

    Can’t rely on just one cloud vendor. Check out this simple animation that shows how to avoid these types of problems:

    You want to look at the “Complete in the cloud IT Organization” at the link below.
    http://www.batblue.com/usecases.php?first=499

  • Jacob Mason

    I completely agree. I was considering using AWS for an upcoming project. The reason I’ve decided against that now is not the downtime, but the abysmal communication.

  • http://profiles.google.com/smithkl42 Ken Smith

    According to what I’m reading elsewhere (http://blogs.gartner.com/lydia_leong/2011/04/21/amazon-outage-and-the-auto-immune-vulnerabilities-of-resiliency/), this outage doesn’t technically count against Amazon’s SLA, which only applies to EC2, and not to EBS or RDS. If Amazon wants to continue their amazing run of really bad PR, all they have to do is insist that they’ve actually fulfilled their SLA and not give any refunds. George Orwell would be proud :-).

  • http://twitter.com/ZacharyRD Zachary Reiss-Davis

    Thanks a lot; I really like your take on this and placing the emphasis on Amazon.com’s lack of transparency, as product failures are one thing, but not communicating to your customers about what’s going on is something else entirely. Your blog post also helped inspire mine: http://blogs.forrester.com/tim_harmon/11-04-22-good_proactive_marketing_cant_fix_problems_like_amazons_ec2_outage .

  • ejg

    Excellent post Ken. I experienced a website being down for an extended period of time yesterday for a pharmaceutical company in which I am an investor and was searching for some timely info. I searched some message boards and determined they may have been affected by this problem. I have since sent an email to investor relations asking if they were affected and why they didn’t post something on their home page apologizing for the outage.

  • Bob

    To people complaining out there … as if you’ve never done anything wrong in life. You’re always welcome to leave Amazon and go elsewhere

  • Madame Hardy

    This is an excellent and informative post. Thank you.

  • http://twitter.com/manojshrm Manoj Sharma

    iFoam® Foaming Screen Cleaner for BlackBerry PlayBook http://bit.ly/ejjdtQ

  • http://twitter.com/danielkushner Daniel Kushner

    Nice piece, but communication or not, what’s your plan B? Nolio customers were able to automatically deploy their applications to US West as well as other cloud providers – http://www.noliosoft.com

  • Marston Gould

    Having this outage was the best thing that could have happened to Amazon. I bet they learn more from this mistake than their successes. They will eventually figure out their issues and work diligently to harden their services more. That’s a good thing for everyone.

    • Weisscrow

      I sell on amazon. Their communications is mired in some sort of legaleze They can’t tell me to leave as that flags some sort of internal controls. But their customer service staff isn’t qualified on any level to answer real issues. This break down is recent. And reflects a change in staff or policy… it is like they get paid merely for quashing the inquiries with no weight placed on real solutions. They are simply toobig to have to care…the worst part is there is no ability to esculate an issue…so at some point they make it clear ..put up and shut up. I would not expose my servers to this black hole of communications..no way. There lack of discussion equates to an admission on the part…take your biz else where

  • http://www.facebook.com/people/Reifen-Tyres/100003254256049 Reifen Tyres

    Hello,i found your blog through search,yours blog is excellent and a nice blog…

    The contents are nice &will grow higher in future…

    thanks,

    http://www.reifen.ms/