Outage reports issued by most cloud companies tend to be meaningless exercises in brevity, designed by lawyers and put into prose by engineers rarely interested in explaining exactly what happened. Kudos to Microsoft Azure’s North Europe team for bucking that trend.
As spotted by The Register, the report issued by Microsoft following a nearly seven-hour outage last week that affected customers of Azure’s North Europe region is a model for what cloud customers should expect in an outage report. Maintenance workers at a data center in the region accidentally released fire suppressant materials, which triggered a shutdown of the cooling units that keep data centers up and running in order to contain the release. That meant that “the ambient temperature in isolated areas of the impacted suppression zone rose above normal operational parameters,” Microsoft said in its report.
Servers and data storage units are designed to recognize when the temperature rises and either shut down or reboot themselves in order to prevent failures. But some of those units “did not shutdown in a controlled manner. As a result, additional time was required to troubleshoot and recover the impacted resources,” Microsoft said. The outage knocked a host of services, including Virtual Machines, Azure Backup, and Azure Functions, offline for several hours until Microsoft could get enough servers and storage units working again to handle everyone’s workloads in the region.
The whole report is available here. Anyone who lost revenue during the outage is unlikely to be totally satisfied with merely a comprehensive report, to be sure. But cloud customers should demand this level of detail from their providers after any significant outage: scroll down the page from the Northern Europe report and read the other reports that affected Azure regions over the last few weeks, they’re not exactly detailed. Amazon Web Services and Google are hardly better when it comes to explaining exactly why their services were offline.
Whether an outage was caused human error, an act of God, or something in between, cloud computing is still run by human beings for human beings. Building trust in that relationship is a good way to retain customers, and detailed reports are a good reminder for customers that always-on cloud computing is an extremely difficult thing to do at scale, even at the most state-of-the-art data centers.