Yesterday, Facebook users everywhere experienced an unexpected and prolonged service blackout that affected access to all of its apps, including WhatsApp, Instagram, and Messenger. In the time since, Facebook has published two blog posts explaining what happened.
On late Monday evening, the company published the first blog post explaining what caused the dramatic problem. Santosh Janardhan, VP of infrastructure at Facebook, wrote that “the root cause of this outage was a faulty configuration change,” elaborating that “configuration changes on the backbone routers that coordinate network traffic between Facebook’s data centers” were where the issues occurred.
That network traffic disruption not only halted services on Facebook-owned apps such as WhatsApp but it also “impacted many of the internal tools and systems we use in our day-to-day operations, complicating our attempts to quickly diagnose and resolve the problem,” Janardhan adds.
Facebook has since published another, more detailed blog post late this afternoon explaining exactly what went wrong. In it, Janardhan writes that “the backbone” he previously referenced “is the network Facebook has built to connect all our computing facilities together,” and this network also links together all of Facebook’s data centers across the world through physical wires and cables. These data centers are responsible for storing data, keeping the platform running, and connecting Facebook’s network to the rest of the internet.
“The data traffic between all these computing facilities is managed by routers, which figure out where to send all the incoming and outgoing data. And in the extensive day-to-day work of maintaining this infrastructure, our engineers often need to take part of the backbone offline for maintenance — perhaps repairing a fiber line, adding more capacity, or updating the software on the router itself,” Janardhan explained.
But yesterday, during a routine maintenance job, “a command was issued with the intention to assess the availability of global backbone capacity,” but it “unintentionally took down all the connections in our backbone network, effectively disconnecting Facebook data centers globally” from each other, and severing their connection to the internet. To make matters worse, the audit tool that usually prevents mistakes like this didn’t catch the problem, due to a bug.
A related issue involves two other pieces of internet architecture: the Domain Name System (DNS) servers and the Border Gateway Protocol (BGP), which advertises the Facebook DNS to the rest of the internet.
“The end result was that our DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find our servers,” Janardhan wrote. “The total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this.”