Microsoft tonight detailed and apologized for a series of glitches that caused a huge email backlog in its cloud-based Microsoft Exchange Online subscription service — in some cases delaying messages for as long as three to nine hours.
The company’s Business Productivity Online Services (BPOS) team, which runs the service, also acknowledged a separate problem that hampered access to its Outlook Web Access service and Outlook email program, and caused problems for Exchange ActiveSync devices.
It’s the latest high-profile problem to hit a major cloud-computing service, although the problems weren’t as public or apparent to the outside as with the recent outage suffered by Amazon Web Services, which affected a series of major web sites. Microsoft’s BPOS customers were able to draw attention to the problems through forum posts and complaints that garnered media attention earlier today.
Microsoft tonight promised service credits for affected customers, vowed to address the underlying technical problems, and said it would improve communication when incidents arise in the future.
“As I’ve said before, all of us in the BPOS team and at Microsoft appreciate the serious responsibility we have as a service provider to you, and we know that any issue with the service is a disruption to your business – that’s not acceptable,” said Dave Thompson, corporate vice president of Microsoft Online Services, in the blog post explaining what happened. (Note: Name corrected since original post.)
More background on the problems in this earlier post from Mary Jo Foley of ZDNet. Here are the technical details from the Microsoft post tonight.
On Tuesday at 9:30am PDT, the BPOS-S Exchange service experienced an issue with one of the hub components due to malformed email traffic on the service. Exchange has the built-in capability to handle such traffic, but encountered an obscure case where that capability did not work correctly. The result was a growing backlog of email. By 12:00am PDT, the malformed traffic was isolated and the mail queues cleared. The delays encountered by customers varied, on the order of 6-9 hours. Short term mitigation was implemented and a fix was under development.
At 9:10am PDT today, service monitoring again detected malformed email traffic on the service. The problem was resolved at 10:03am, but users experienced up to 45 minute email delays during this time. A second, but related issue was detected via monitoring at 11:35am PDT, resulting in email stuck in some end users’ outboxes. The issue was remediated at 12:04pm PDT. During this time, more than 1.5 million messages had queued on the service awaiting delivery. The backlog was 90% clear by 4:12 PM, but because of this large backlog of email, customers may have experienced delays of as long as 3 hours. We are implementing a comprehensive fix to both problems.
In an unrelated incident, starting at 1:04am PDT, service monitoring detected a failure in the Domain Name Service (DNS) hosting the http://mail.microsoftonline.com domain. This failure, prevented users from accessing Outlook Web Access hosted in the Americas, and partially impacted some functionality of Microsoft Outlook and Microsoft Exchange ActiveSync devices. The team diagnosed, and fixed, an underlying problem in the servers hosting Domain Name Service (DNS) for the http://mail.microsoftonline.com domain, and restored service at 4:52am PDT. The team identified a number of improvements in our handling of problems associated with DNS, and will provide a full post mortem of this incident available through Microsoft Support.