In early October, Facebook and their apps – including Instagram, WhatsApp and Messenger – effectively disappeared from the internet for more than five hours. Best estimates are that this outage affected over 3.5 billion (yes, “billion” with a “b”) people around the world.
While Facebook and its apps returned to service within the day, it was one of the longest social media outages with the greatest impact on the internet we’ve ever seen.
Naturally, it begs the questions: What happened to cause it? And were Facebook’s users ever at any risk during the time it was down?
Let’s answer the easy question first: There was no risk to users during this outage. All the information that Facebook has put out and the discussion around it by cybersecurity experts indicate that, given the specifics of the situation, there was no risk to users or their data during this time.
While that’s comforting for Facebook customers, it makes the question of “what happened” even more important.
To understand that, you have to understand some of how the internet and sites on it work. There are explanations out there, but to understand them, you have to understand what things like “Border Gateway Protocol (BGP)” are.
If you do understand that, this explanation isn’t for you. Here’s a good, detailed technical write-up that you might find helpful.
If you don’t know what BGP is (and don’t necessarily care to) but still want to understand this outage, read on.
When we say that Facebook and its apps seemed to disappear from the internet, that’s not just hyperbole: it actually gets to the heart of what happened. Put simply, there was an issue that made it impossible for all the systems and devices on the internet to actually locate and speak with the servers that make up Facebook and its apps.
When you communicate with Facebook and its apps, your computer or device is exchanging network packets with the Facebook servers. While we talk about “the cloud” the internet is still at heart a physical thing made up of computers, devices, servers, and network packets that run between them.
You can think of network packets almost like an airplane that carries information from your device or system to those servers. The information you get back also travels in its own airplane: network packets that have the response and information you’re waiting for.
When we fly long distance, the airplane that we’re on talks to the local air traffic control (ATC), which provides directions on how and where to fly to get to the final destination. On long-distance flights, your airplane will talk with a series of air traffic control in a sequence of “hand offs.” Each air traffic control knows only enough information to guide the plane to the next controller. This works because, ultimately, all of the air traffic controllers know the correct air traffic control to get you to your final destination.
If you’re flying from London to Rome, for example, your flight may be talking with air traffic control in London, then Paris, then Frankfurt, then Milan, then Rome. The London air traffic control won’t talk to Rome directly, but it knows to get your plane to Rome: it has to talk to Paris, Paris has to talk with Frankfurt, Frankfurt with Milan, and then Milan to Rome.
The same thing happens on the internet with your network packets. Your packets are passed from network to network. And while each network may not know directly how to get your packets to Facebook, they do know which other networks will know and can route those packets appropriately.
For all of this to work, the air traffic control system has to know where the right final air traffic controller is for your destination and how to contact them.
This is where the problem happened for Facebook. For five hours, they essentially removed the information all of the other networks needed to successfully route traffic to them.
Going back to our example, it would be like the Rome air traffic control going offline and not talking to any other air traffic control out there.
When something like this happens with airplanes, we see disruptions and cancellations, since planes can’t get to their final destinations.
Basically, the same thing happened here: The network packets going to Facebook couldn’t be routed correctly.
This also explains why it took so long for Facebook to fully recover. Just like it takes time for the new information about air traffic control to filter out through the system, it takes time for the updated network routing information to filter out. And just like we see with air traffic control outages, in the immediate aftermath of an outage there’s a lot of disrupted and backed up traffic that needs to get through the system before things return to normal.
You can see how in this outage, it has nothing to do with what’s going on with your computer or device, beyond being unable to get to the Facebook systems. You can also see how this doesn’t involve personal data that’s held by Facebook. This is why there’s really no danger to Facebook users or data from this situation. It’s simply a case that for a few hours no one could actually get to Facebook’s servers.
While it may be surprising to realize how fragile parts of the internet are, it’s a fact. Networking issues like this happen from time to time and there’s no hacking or attacking involved; it’s just the result of an error. In this case, it was an error that cascaded in such a way that it became a major, serious outage.
The takeaways from this are that while situations like this are frustrating, they don’t represent a danger to you or your data. And the odds are good that something like this will happen in the future, though maybe not so spectacularly in terms of impact. There are parts of the internet infrastructure that are weaker than we’d expect. And, as a result, sometimes they will break down.
*** This is a Security Bloggers Network syndicated blog from blog.avast.com EN authored by Avast Blog. Read the original post at: https://blog.avast.com/understanding-the-facebook-outage-avast