It’s kind of amazing how reliable modern internet services have become over the last decade, which is why we freak out when they go down. But for Slack’s Julia Grace, each outage is an opportunity to learn something new about how to scale massive web services without throwing stones at the responsible team.
Grace, director of infrastructure for the one of the buzziest enterprise collaboration startups on the planet, walked Structure 2017 attendees through the company’s Halloween 2017 outage on Wednesday, the most significant downtime event it experienced in 2017. Slack users can spend up to 10 hours a day connected to the workplace collaboration service, and “if the product isn’t up, if it isn’t fast, if it isn’t reliable, you can’t do your job,” she said.
That afternoon, Slack deployed an infrastructure change that Grace said it thoroughly tested but it caused “a mass disconnect, and then a mass reconnect,” she said. Basically, everybody got kicked off Slack and then everybody tried to load Slack; “it’s similar to DDoSing yourself,” Grace said, referring to the distributed denial of service attacks used by those who wish to force websites offline by bombarding them with traffic. The company fixed the problems after a couple of hours, but it wasn’t exactly a treat.
This is a process that almost every company operating at massive scale on the internet has gone through; before it became a disinformation conduit and safe space for Nazis, Twitter’s outage troubles nearly derailed all of the company’s momentum until it discovered ways to finally bury the iconic “fail whale.” Everyone learns from the companies that have scaled before them, and can often rely on open-source technologies that were developed in response to those problems, but they also encounter brand-new problems unique to their services that require new ways of thinking about infrastructure.
“We’re building very very complex systems,” Grace said. “Things are going to fail, unknown unknowns are going to happen. Our infrastructure is in a better state today for having gone through all this.”
Slack was pretty conservative in its early days when it came to infrastructure decisions, opting for tried-and-true technologies over the enterprise computing flavor of the week. That’s starting to change a little under Grace’s watch, after she built a 45-person team to focus on infrastructure strategy and maintenance within a company that had been hyper-focused on product engineering until she arrived two years ago.
The company has developed new technology to handle some of its unique problems, such as an edge caching system that stores critical data in resources that are closer to end users around the world. Acknowledging the debt of Slack — and nearly every modern enterprise software company, to be sure — to the open-source community, she said Slack hopes to open source that caching system at some point. But the company wants to make sure the tech is mature and that Slack is ready to be “good maintainers” of an open-source project.