Amazon Web Services: A year after the big outage, one startup's perspective

It’s hard to believe that it’s been a year since the big April 2011 Amazon Web Services outage. But we figured the anniversary of that painful week would be a good time to provide an update on our experiences since we got knocked offline for several days.

When Keith Smith and I launched BigDoor almost three years ago, we chose to drive our business using many of the same principles from Eric Reis’ Lean Startup. We iterate on our product weekly — continually innovating on behalf of our customers. We run a very small team, and always look to reduce waste.

With regard to operations, from the onset I was dead-set on minimizing the time we spent in rack-stack-and-cable mode, in order to maximize our product focus time. Amazon Web Services allowed us to do that, accelerating our time to market and giving us a level of agility that only server virtualization can.

Looking back over the past twelve months, and revisiting the changes we made in response to the outage, brings some worthwhile observations to light:

AWS took action to address their communication and transparency issues. (See Keith Smith’s original guest post on GeekWire: Amazon.com’s real problem isn’t the outage, it’s the communication)
If you need or want what Elastic Block Store provides, options haven’t changed, but risk may have.
EBS IOPs performance continues to be the primary concern that has persisted for us and every other AWS EBS user I’ve talked to.

AWS communication: We now have recurring meetings with our AWS account rep and a great solutions architect. These meetings have been huge for us, largely because we get a sense for what’s going on inside Amazon.com, and to a lesser degree because we can get answers to myriad questions that would take forever to iron out over email.

But that’s just us. (I’d love your comments if you’ve had a different experience). In any case, there are still two big things that Amazon could do that I think would make a huge difference:

Publish what they did over the last year to address what happened, referencing their own root cause analysis and its listed remedies.
Make some form of their product roadmap available publicly, or at least to all AWS customers.

For the #2 ask, we all get the idea behind keeping this kind of thing under wraps and away from competitor’s prying eyes. But there must be a low-risk way to give us all a better idea for what’s coming, and when.

Amazon is in the unique position to not have to worry about competitors catching up – as far as I’m aware, no one is even close to providing the breadth of services they are. (Related story: Stat of the day: A third of all Internet users visit a site that uses Amazon’s infrastructure)

Elastic Block Store options: If you want the convenience, persistence and back-up-ability of EBS drives, not much has changed as a result of last year’s outage. No new mitigating product or features are available. There’s always ephemeral storage, which you can even get better performance from via software RAID. But for us, ephemeral-backed instances start too slowly (2-3x, in our testing) to handle the needs of our spiky traffic and app host scaling groups, and using non-persistent RAID storage on our database hosts doesn’t fly for various reasons.

Making EBS resources cheaper and easier to manage between regions would go a long way to help, as one potential option here.

While we took a number of smaller steps in reaction to the outage, ultimately we’ve still got a same-ish EBS risk profile; we chose to continue our break-neck pace of iteration and product development, betting on Amazon’s ability to prevent this from happening again. A year later, that bet seems to have paid off.

Has the overall risk of using EBS changed? See my #1 ask above.

Elastic Block Store Input Operations Per Second performance: I attended and spoke at Percona’s MySQL Live conference last week (slides, if you’re interested), and amongst all the discussion around sharding and database performance, there was plenty around AWS and virtualization in general. The big gripe, almost a foregone conclusion in many minds at this point, is still that Elastic Block Store Input Operations Per Second (EBS IOPs), particularly writes, are slow and that their only predictable aspect is their unpredictability. EBS IOPs performance is highly variable.

We do a lot to mitigate this (caching, buffering, etc). We sharded our database systems last year, going from a single monolithic primary database host to 32 sharded nodes, and we’re testing a new real-time custom BI platform, all built with help from EBS. In terms of HTTP requests our API handles in a couple months the amount of traffic it did in the year prior to the outage, with low latency (~.4s avg). But we’re still bitten on occasion by transient, bigger dips in EBS IOPs performance.

Has performance in aggregate actually gotten better over the last year, though? See my #1 ask above.

My guess is that Amazon is working hard to address the IOPs performance issues, but it would be great if we all knew – see my #2 ask above.

Even in consideration of the insane new Washington state taxes we’re having to pay now for AWS usage, the result of horrible decision-making on the part of our short-term-thinking legislators, it’s been a great year, and 2012 is looking even better for both BigDoor and for AWS.

So, my final request for Amazon Web Services: how about a little retrospective, and look ahead from your side?

Jeff Malek is Co-Founder and CTO of BigDoor, a Seattle startup that offers a way for websites to easily create their own Gamified Rewards Program. You can follow the company on Twitter @bigdoor and Jeff @jpmalek.

Amazon Web Services: A year after the big outage, one startup’s perspective

Most Popular on GeekWire

Job Listings on GeekWork

Related Stories

Most Popular on GeekWire

Job Listings on GeekWork