Update 6:13pm PT: Google sounded the all-clear at 5:09 pm PT, saying that all service had been restored. Our original story follows below.
Six weeks after Google dubiously claimed it ran the most reliable cloud computing service of the Big Three cloud providers, a widespread networking issue took out Google Cloud service on the East Coast of the U.S. and parts of Europe Sunday, according to the company status page and frustrated users on Twitter.
At 12:25pm PT, Google Cloud acknowledged that it was having issues with the core Google Compute Engine service, later updating its status page to point the blame at unspecified problems with Google Cloud Networking. Details are scant, but at first blush the outage sounds similar to the networking-related outage that Microsoft suffered last month that affected service for hours.
Major Google services including Gmail, Google Calendar, and Google Hangouts were also affected by the outage, according to the G Suite status page. Service appeared to be working normally on the West Coast of the U.S. with the exception of Google Analytics, which was not working in Portland Sunday afternoon.
Widely used services such as YouTube and Snapchat were also down in the affected regions, which appeared to include parts of the Northeast and Europe. DownDetector, which tracks user-submitted reports of outages, observed a large outage in Europe as of 1:25pm PT Sunday for Snapchat, one of Google Cloud’s largest customers.
The outage is a black eye for Google’s attempts to portray itself as the most reliable cloud computing service during its Google Cloud Next event in April. The sources cited by Google on a presentation slide making that claim during a major keynote address were unwilling to back up Google’s conclusions when contacted by GeekWire after the event.
As of 1:21pm PT, Google said that engineers were working to mitigate the issues and promised to share more details by 2pm PT. We’ll update this post as more information becomes available.
Update 1:44pm PT: In an update to the Google Compute Engine status page, Google shared a little more information about the problems and promised a fix was on the way.
We are experiencing high levels of network congestion in the eastern USA, affecting multiple services in Google Cloud, GSuite and YouTube. Users may see slow performance or intermittent errors. We believe we have identified the root cause of the congestion and expect to a return to normal service shortly.
Update 3:06pm PT: Google updated its status page again: “Our engineering teams have completed the first phase of their mitigation work and are currently implementing the second phase, after which we expect to return to normal service. We will provide an update at 16:00 US/Pacific.”
Update June 4th: Google provided a few details about the cause of the outage Tuesday, and said a formal incident report will be issued shortly. The short version:
In essence, the root cause of Sunday’s disruption was a configuration change that was intended for a small number of servers in a single region. The configuration was incorrectly applied to a larger number of servers across several neighboring regions, and it caused those regions to stop using more than half of their available network capacity. The network traffic to/from those regions then tried to fit into the remaining network capacity, but it did not. The network became congested, and our networking systems correctly triaged the traffic overload and dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows, much as urgent packages may be couriered by bicycle through even the worst traffic jam.