As some sites continue to show signs of Amazon.com’s cloud computing glitches, the company updated its public Amazon Web Services Health Dashboard a short time ago to say that it’s “starting to see more meaningful progress in restoring volumes” in the area of its Elastic Cloud Compute (EC2) service that has been suffering problems since early yesterday.

The company said in an earlier post on the dashboard that it’s waiting for complete recovery to conduct a full post mortem, but AWS customers are already reflecting on what happened and discussing how they can avoid falling victim to future outages.

One of the biggest surprises is that glitches in Amazon’s Northern Virginia data center affected other zones of the service, despite Amazon’s strategy of “insulating” individual zones from problems in others. Jon Brodkin of NetworkWorld has a good post on that topic.

Many sites that rely on Amazon Web Services have recovered, some using workarounds. Others, such as youth sports site Blue Sombrero, haven’t been so lucky. The message currently posted this morning on its site captures the mood on Day 2.

We have been up all night hammering at the Amazon support techs. Unfortunately, we are at their mercy at the moment. As one of our customers yesterday put it, the “internet gods” are not happy right now.

Amazon is the best, most reliable cloud service provider in the world, and though that isn’t much of a consolation right now, we haven’t faced any issues with them since moving to their platform over two years ago. In all of our contingency planning, we did not foresee an event like this. There is no data loss at all here, just complete internet outage!

We hope for a resolution soon, but we will continue to work through the day, night and weekend to get all our services back online. The techs at Amazon are working on this round the clock and are making progress. We know how important it is to get things back up and running as soon as possible, and we can assure you that we are doing everything in our power.

A guest post by Keith Smith of Seattle gamification startup BigDoor on GeekWire last night has been receiving widespread attention across the web, calling on Amazon Web Services to be more transparent in its communications about the problem. Smith writes that Amazon’s updates “read as if they were written by their attorneys and accountants who were hedging against their stated SLA (service level agreement) rather than being written by a tech guy trying to help another tech guy.”

Also see interesting coverage and perspective on Mashable and AllThingsD.com.

Comments

  • http://www.atebymonsters.com Matt

    It’s not really going to change … my company uses Microsoft Online Services and we constantly fight the battle of outage communication. It really takes Microsoft months before they send out any kind of resolution, and that’s a report you need to request from them. Their health dashboard also greatly delayed.

    I remember my piece I wrote for you guys at TechFlash about the issues with the Cloud. This might prove the point a little more that you can’t put all your eggs into one basket.

  • http://profiles.google.com/smithkl42 Ken Smith

    There’s been some good news on this front, actually. It turns out that according to Amazon’s SLA, there hasn’t actually been any downtime. So there isn’t anything to worry about. http://blogs.gartner.com/lydia_leong/2011/04/21/amazon-outage-and-the-auto-immune-vulnerabilities-of-resiliency/

Job Listings on GeekWork