A series of failures at Microsoft’s San Antonio data centers continued to cause issues for some customers stretching into the second day back from the holiday weekend, although it seems like the worst is over.
Azure’s status page continues to warn of potential problems with nearly everything run out of its South Central US cloud region, which was knocked offline Tuesday after an overnight lightning strike interrupted power to the facility’s cooling systems. The company was forced to shut down systems to prevent damage from overheating, and it is taking a long time to restore some of the storage services associated with that region.
The good news (depending on your outlook) is that Active Directory and Visual Studio Team Services are working again in regions outside the South Central US, allowing software developers using those services to get back to their jobs. Visual Studio Team Services, a cloud-based development environment that is an important cog in the software development process, remains sluggish and sporadically unavailable out of the South Central US data center, according to that division’s blog.
Hopefully Microsoft follows the example it set last year after an outage in Northern Europe caused by a similar weather incident took down service for seven hours, and puts together a public, comprehensive post-mortem on what happened and how it will be avoided in the future. There are always going to be outages with cloud services from any vendor, especially when severe weather is the underlying cause, but this is a pretty bad one.
And while the exact details surrounding which cooling systems on which buildings failed, this service disruption could be a good advertisement for the benefits of availability zones, which allow cloud customers to spread their workloads around several separate buildings within a given cloud computing region in hopes of avoiding issues with a single data center building.
This setup was not a part of Microsoft’s infrastructure strategy until last year, and it has only rolled out availability zones to all of its customers within three of the 54 regions it operates worldwide (they are available in East US 2 and Southeast Asia as a preview). It’s long been part of rival Amazon Web Services’ infrastructure strategy, however, and after this event is likely something that’s going to come up more often in cloud deal negotiations.
Microsoft, through its public-relations agency, has not responded to requests for more information on the details of the issues in San Antonio. The Azure status page, which has been updated multiple times over the last 36 hours, says to expect the next update by 1pm Pacific Time.