Amazon.com's real problem isn't the outage, it's the communication

Guest Commentary: Like many companies running on Amazon’s Web Services, BigDoor has been affected by the AWS outage today. And like most startups, we are braced for bad stuff to happen, and we do our best to learn from the painful stuff. We spent a better part of the day in constant contact via Twitter, emails and phone calls apologizing and updating our more than 250 publishers that were affected. Today has provided plenty of lessons, and because transparency is fast becoming the lifeblood of Seattle’s startup community we thought we’d pass a few along.

At BigDoor, we made the decision early on to host everything we do with Amazon Web Services. Generally speaking, we have been huge fans, and we regularly find ourselves singing their praises to anyone who will listen. AWS has allowed us to scale a complex system quickly, and extremely cost effectively.

At any given point in time, we have 12 database servers, 45 app servers, six static servers and six analytics servers up and running. Our systems auto-scale when traffic or processing requirements spike, and auto-shrink when not needed in order to conserve dollars.

[Related: Day 2: Amazon cites progress on cloud glitches, not everyone seeing it]

In the ten months since we launched the public beta of our free, self-serve gamification platform we have handled over one billion API calls. Without AWS, that simply would not have been possible with our small team and limited budget. Many others have realized similar benefits from the cloud, and AWS has quickly become a critical part of the startup ecosystem.

That’s not to say everything has been perfect.

The most notable ongoing issue we’ve experienced has been unreliable disk IO, specifically resulting in periodic slow disk writes. This is an understandable technical complexity given the nature of cloud computing, but early on we realized that this problem was being seriously exacerbated due to Amazon’s unwillingness to publicly admit and discuss the severity of the issue.

We’ve managed to find workarounds to the technical challenges. But it was disconcerting to us that Amazon’s otherwise stellar system was being marred. Not so much by a temporary technical issue, rather by what seemed like an unwillingness to embrace transparency.

Today that lack of transparency has continued.

As problems continued throughout the day, we experienced the obvious frustration from the system failure. But Amazon’s communication failure was even more alarming.

Our API, our main website and our publisher admin services have been almost entirely offline since around 1:30 a.m. PST, with brief periods of momentary availability. This has effectively rendered us powerless to service our customers.

Starting at 1:41 a.m. PST, Amazon’s updates read as if they were written by their attorneys and accountants who were hedging against their stated SLA rather than being written by a tech guy trying to help another tech guy.

We aren’t just sitting around waiting for systems to recover. We are actively moving instances to areas within the AWS cloud that are actually functioning. If Amazon had been more forthcoming with what they are experiencing, we would have been able to restore our systems sooner.

There are a lot of really obvious and relatively easy things that any startup can do to avoid an all-out reliance on any single cloud provider, but those things take additional time and money – two of the most important things that every startup is constrained by.

We absolutely love AWS because of the pace of innovation and scale that it has allowed us to accomplish. But after today’s episode is over, we will have a big decision to make.

We can spend cycles designing and building technical belts and suspenders that will help us avoid a massive failure like this in the future, or we can continue to rely on a single huge partner and also continue our break-neck pace of iteration and product development.

I can’t tell you today which option we will choose. But I’m sure it will be the question on the mind of many startups across the country.

If we come up with a good answer, we will be sure to be good members of Seattle’s startup ecosystem by being transparent and we’ll share our solution. I encourage all other startups to do the same, and I hope this serves as a public request for Amazon to join the crowd.

Keith Smith is CEO of BigDoor, a Seattle startup that builds game mechanics into online publisher’s Web sites. You can follow the company on Twitter @bigdoormedia.

Most Popular on GeekWire

Job Listings on GeekWork

[Related: Day 2: Amazon cites progress on cloud glitches, not everyone seeing it]

Related Stories

Most Popular on GeekWire

Job Listings on GeekWork