The seven-hour outage also took down WhatsApp and Instagram, thanks to a botched update in the Border Gateway Protocol.

Pádraig Belton, Contributor, Light Reading

October 4, 2021

4 Min Read
The seven-hour glitch: How a DNS fail felled Facebook

It was the day the poking stopped.

And as it turns out, the temporary dive off the Internet by Facebook, Instagram and WhatsApp – all now owned by Mark Zuckerberg's Facebook – happened when the company made a botched update affecting the domain name system, experts tell Light Reading.

This is the Internet's telephone directory which translates alphabetical web addresses into numeric IP addresses.

Figure 1: Happier times: As social media addicts around the world struggled to cope without Facebook and Instagram, Twitter was the definite winner. (Source: Tim Bennett on Unsplash) Happier times: As social media addicts around the world struggled to cope without Facebook and Instagram, Twitter was the definite winner.
(Source: Tim Bennett on Unsplash)

At fault was a routine update introducing a mistake into the Border Gateway Protocol, which contains the IP addresses of the DNS nameservers.

"It seems that Facebook have made changes to their network infrastructure affecting DNS," says Adam Leon Smith, the chair of BCS (formerly the British Computer Society), and chief technology officer for the technology consultancy firm Dragonfly.

"DNS is a single point of failure, and the result is that anything with a facebook.com or fb.com email address is now completely down," he says.

And so for a period of about seven hours, Facebook, Instagram and WhatsApp became inaccessible globally for large numbers of their users, starting at 11:40 a.m. on the East Coast of the US.

The result was the high comedy of Facebook stooping to the depths of using Twitter to communicate with its users.

Meanwhile making it slightly trickier to fix all this, thanks to the DNS failure "all internal systems are down, email is down, even door locks," Facebook employees posted on the forum site Reddit.

Snow day like today

So the people with remote access to reverse the botched update were locked out by the update. The people with physical access didn't have authorization on the servers.

Oops.

On top of this, thanks to the coronavirus, Facebook has moved to largely virtual workforces, and the engineers who know how to fix this kind of problem were in all likelihood most likely not near a data center.

So solving this kind of problem requires significant team effort using collaboration tools "we don't really use in tech, like landlines," Smith says.

The internal as well as external failure of the DNS servers servicing Facebook and the other apps it owns led to its staff twiddling their thumbs in between bouts of heading over to Twitter to communicate with the public.

"It does feel like a snow day," tweeted Instagram CEO Adam Mosseri.

"This is another example of how fragile the Internet is, and how easily a single mistake can stop the world," he adds.

It's all a similar catch-22 to Google Cloud's epic fail, in 2019, when Google engineers couldn't get online to fix the Google Cloud outage because the Google Cloud outage kept them offline.

Other apps like Gmail, TikTok and Snapchat also began to show significant slowdowns, and users flooded over to other social media platforms.

"Hello literally everyone," tweeted Twitter an hour into the outage.

If you have an Oculus, it too became an awkwardly shaped paperweight, so closely related are its services to a Facebook account.

The stop whistle

Zuckerberg's personal wealth fell by more than $6 billion as Facebook's stock plummeted by 4.9%, on top of a 15% drop since mid-September.

Facebook also has been reeling in that time from the revelations of a whistleblower, former product manager Frances Haugen, about how the company dealt with misinformation about the January 6 capitol riots, and Instagram's detrimental effects on the mental health of teenage girls.

Haugen previously worked on Facebook's civic integrity team until quitting in May, claiming the company repeatedly prioritized growth over safety.

Want to know more about the cloud? Check out our dedicated cloud-native networks and NFV content channel here on Light Reading.

She took with her a cache of internal memos and documents which the Wall Street Journal has released in batches over the last three weeks.

With lawmakers, prosecutors and regulators already growing skeptical of Facebook around the globe, the botched BGP update may ultimately prove the least of Zuckerberg's worries.

Meanwhile if today you felt a slight sense of déjà vu, it's only because it really has all happened before: On April 14, 2019, the Facebook-owned platforms went down too.

Related posts:

Pádraig Belton, contributing editor special to Light Reading

About the Author(s)

Pádraig Belton

Contributor, Light Reading

Contributor, Light Reading

Subscribe and receive the latest news from the industry.
Join 62,000+ members. Yes it's completely free.

You May Also Like