Yesterday, Microsoft Teams—a combination instant messaging, chat, and collaboration package competing with Slack and the new version of Google Hangouts—was inaccessible for several hours, from approximately 8:30am to 11:30am ET.
By 10:30am, Microsoft acknowledged on Twitter that the outage was the result of an expired SSL certificate. Approximately an hour later, they had secured a replacement certificate and began deploying it in production, with service widely restored by Monday afternoon.
This isn’t Microsoft’s first major public embarrassment due to a service renewal failure. The company was responsible for one of the most famous “oops, we accidentally the whole domain” incidents in 1999, when it allowed the domain registry for passport.com to expire. The domain was responsible for authentication for a variety of Microsoft services, including Hotmail.com and Microsoft Messenger.
Shortly after Passport.com’s expiration made the front page at Slashdot, a Hotmail user who says he “wanted to see what would happen” paid the $35 renewal fee himself, restoring service. Microsoft later reimbursed the good Samaritan, Linux consultant Michael Chaney, with a $500 check that he in turn auctioned off on eBay for $7,100, donating the proceeds to charity.
A few years later, Microsoft dropped the ball on domain registration again, allowing hotmail.co.uk to go dark in 2003—and this time, the private individual didn’t just pay the fee, they actually bought the domain. Happily, the anonymous individual who purchased the expired domain did not change its DNS records and transferred it back to Microsoft shortly afterward.
The company has not yet made a postmortem analysis of yesterday’s Teams failure available. Most reports have characterized it as somebody forgetting to renew the certificate, but it’s equally possible that an automated renewal system failed and no one at the company detected the problem until after the service was widely reported as down.
We won’t presume to tell Microsoft how to run a 20 million user service, but smaller operations can easily avoid similar issues—the EFF’s Certbot automates renewal of free Let’s Encrypt SSL certificates, and the Nagios monitoring system includes a plugin that automatically tests deployed SSL certificates and warns its operator if they are approaching their expiration date.