Starting a little after 9:30 a.m. Pacific time Tuesday, and lasting close to five hours, the S3 cloud storage service started experiencing “high error rates.” This outage knocked out access to a litany of websites and apps that run on AWS, including but not limited to Expedia, Slack, Medium, the U.S. Securities and Exchange Commission. The outage even temporarily affected the AWS service health dashboard, which displays outages and events.
The dashboard not changing color is related to S3 issue. See the banner at the top of the dashboard for updates.
— Amazon Web Services (@awscloud) February 28, 2017
Amazon has not fully detailed what caused the high error rates. Nick Kephart, senior director of product marketing for San Francisco-based network intelligence company ThousandEyes, monitored the outage throughout the day. He said information could get into Amazon’s overall network, but attempting to establish a network connection with the S3 servers was like hitting a wall. It stopped all traffic dead in its tracks. So any site or app that hosted data, images or other information on S3 was affected.
Without having access to Amazon’s servers, Kephart couldn’t say why it became impossible to connect with the S3 servers. He said it isn’t clear if it was a human error, or infrastructure failure, or a configuration problem or an automation issue that caused the problem. But he theorized it was a pretty complicated malfunction given the proliferation of the outage.
“It wasn’t just the system completely misbehaving but something deeper in the infrastructure that caused these problems,” Kephart said.
ThousandEyes also produced this visualization to show the extent of the outage and all the interactions within the AWS network.
As to why the outage was so widespread, Amazon’s status as cloud king, with a market share of more than 40 percent comes into play. Another factor, Kephart said, is the way AWS programs are built on top of each other, meaning that S3 going down impacts other services.
“Amazon Web Services builds many of their individual services on building blocks built on each other,” Kephart said. “S3 is one of the very fundamental building blocks of AWS. When S3 fails, many, many, many other services fail alongside because they are all built on top of S3.”
Now that the issues have been worked out, the question turns to what can be learned from this outage. Several experts surveyed by GeekWire say the most important takeaway from this event is the necessity of redundancy in cloud storage.
Shawn Moore, CTO of Orlando-based web experience platform Solodev, said all technology fails at some point. Large swaths of the internet went down in Tuesday’s outage, but other sites and apps didn’t experience any disruption. Those are the ones that had their data spread across multiple regions.
“The ones who have fully embraced Amazon’s design philosophy to have their website data distributed across multiple regions were prepared,” Moore said. “This is a wakeup call for those hosted on AWS and other providers to take a deeper look at how their infrastructure is set up and emphasizes the need for redundancy – a capability that AWS offers, but it’s now being revealed how few were actually using.”
David Linthicum is senior vice president at Cloud Technology Partners, a company based in Boston that helps enterprises migrate their data to cloud storage providers like AWS, Microsoft Azure and Google Cloud. He said the outage seems like an isolated incident, something that is bound to happen occasionally.
“Systems fail, and from time to time clouds will fail,” he said. “Amazon’s ability to get things up and running quickly, and get back to business, will be the real test,” he said.
Linthicum went on to say that he doesn’t think Tuesday’s outage will keep people from using cloud storage.
“Amazon Web services, and the other public cloud providers, pretty much stay on top of their operations,” he said. “Certainly much better than enterprises do.”
In addition to pushing redundancy and hosting data at multiple centers in different regions, experts emphasized using multiple cloud providers to store data. Not only does that protect customers from a system-wide outage, it can also let users switch between providers as cost dictates.
Akash Nankani, a former lead program manager at Microsoft and founder of NanSoft Studios and creator of the government filing tracking site SECGems said he tries to make his products “provider agnostic,” so that if an incident like Tuesday’s AWS outage went on for a long time, he could make a quick change to remove AWS dependency.
“In my view, every business should ask this question to themselves: ‘If tomorrow, for whatever reason (valid or invalid), if Amazon (or any other provider that you depend on) decides to ban/blacklist my account or business, how will I deal with it? How soon before I can recover from it? And have I pro-actively tested this scenario before it occurs?'”
“While I have a great deal of respect for Amazon/Microsoft/Google/IBM Bluemix/OVH, etc. and have used/experimented with all of them, from a business continuity perspective, I think investing in ‘multi-provider’ support is more important than ‘multi-region.’ This also comes with the benefit of dynamically switching to lowest cost provider as well as dealing with provider/regional outage.”