I can’t call Grandma! Facebook was down
Apart from members of the Sentinelese, the most isolated people on the planet, and we are pretty sure that at least one of them will be on TikTok or YouTube, you will have noticed that Facebook, WhatsApp, and Instagram were down on 04/10/21 from about 15:50 UTC until 21:20 UTC.
That would be a pretty major outage for a small business, so it is a massive deal for the Facebook juggernaut (juggernaut is derived from a Sanskrit word, who knew?), not to mention the billions of people who use it for fun, frolics and in some cases very serious business and care communications, for instance, calling your Grandmother.
Back in the day, many of us worked on Enterprise systems built, supported, architected (when it was invented), then managed them (including change control and ITIL), then lost the will to live.
It has ever been the case that any issue, large, small, or in between has nearly always been blamed on the network, and a lot of us types started out in networking. In fact, the root cause of many issues was, and still are, network-based, especially when changes are made.
The questions: ‘Why were the changes made at that specific time, why were they made, how were they approved and where is the rollback plan?’, turned into what is now known as ITIL, which is, given some of our experiences of, for instance, sawing off the branch you are sitting on, a good thing.
For those not experienced with the ‘sawing off the branch you are sitting on’ thing, it is an expression for when you configure a remote device (usually a Cisco Router)., The change you made causes network failure and you cannot reconnect to the device to remedy the situation and need someone to access the device physically. Often tricky.
In order to mitigate this risk, engineers often instruct a device to reboot after a certain amount of time so they will come back online with the working configuration or use systems that have commit and rollback functionality (like Juniper for instance).
In the case of serious, professionally designed, deployed, and managed networks, a management network is implemented connected to the management ports of devices and guaranteed accessibility if the primary network fails. This is standard stuff. Very standard stuff. It is called Out Of Band Management and has been around for ages. Servers have ‘lights out’ or ILO ports, routers and switches have config ports and best practice is to have these accessible in the event of a disaster. We have built and used many of these successfully in anger (proper anger) for years.
Apologies if that was all a bit geeky but the long and short of it is this: Proper networks are built to be accessible for configuration in a total failure situation.
Which brings us to the following questions:
- What caused all of the Facebook applications to go off air?
- Why for so long?
- How were they recovered?
- What measures will be put in place to prevent a recurrence in terms of people, processes, and technology?
And of course:
- Is there foul play?
- Will we ever know?
Clearly, this issue has and will continue to be covered in detail by the big boys. The most interesting pieces we have seen are from Cloudflare (or 126.96.36.199 as they are sometimes called) and by the esteemed (we are not worthy) Brian Krebs.
Summarising the currently known facts:
- Changes were made to Facebook’s Border Gateway Protocol (BGP) configuration at about 15:45 UTC It is not for this blog to explain about BGP which as you probably know is how one autonomous network system (like Facebook/WhatsApp/anybody with private IP space connects to the Internet, which is a network of networks and is possibly more complex than DNS which we discussed last week.
- These changes caused all of Facebook’s systems to be unreachable because their name servers became unreachable.
- This in turn caused a storm of traffic generated by man+dog’s browsers and apps desperately seeking (Susan) FB.
- At around 21:00 Facebook’s systems generated BGP updates bringing services back online.
There are some very serious questions that need addressing around the root cause and time to resolve this issue,
A Krebs points out, this outage occurred:
…just hours after CBS’s 60 Minutes aired a much-anticipated interview with Frances Haugen, the Facebook whistleblower who recently leaked a number of internal Facebook investigations showing the company knew its products were causing mass harm, and that it prioritized profits over taking bolder steps to curtail abuse on its platform — including disinformation and hate speech.
Facebook is claiming that it took so long to resolve because the outage prevented physical access to the affected devices because key cards no longer functioned. Having worked in the world’s largest Data Centers for many a year, we have to say this reeks of BS.
So let us have a little game. We can call it Precludeo (tm):
- A genuine configuration mistake that locked Facebook engineers out of their edge routers physically and remotely?
- If this is the case, FB should give us a call!
- Malicious action by an employee or ex-employee, perhaps associated with the CBS interview (above)?
- If this is the case FB will be giving its ‘Human Capital department and/or the FBI a call!
- An automated change facilitated by Machine Learning, Artificial low Intelligence and complacency?
- A total lack of correctly managed process with unforeseen circumstances due to, frankly, poor design?
- Malicious activity by parties third?
- Nation state? Show of force?
- Some form of Ransom
What is fairly certain is that we are unlikely to ever know the truth.
We can tell you that we have experience of pressing the wrong runes on a very large BGP configuration and taking down a significant network. Making a ‘simple’ change outside of change control since you ask.
It was back up within 15 minutes by which time nails had been bitten, lamentation of the responsible engineer heard, burnt offerings made to the Gods of routing, contracts cancelled. The whole bit.
Amongst the lessons we need to learn is the fact that we cannot rely on unregulated services to deliver to a service level, like say the Public Service Telephony Network of most countries including interconnects.
A small change can have massive repercussions.
Change control is so vital in supporting the reliability of unregulated services. It must not be overlooked.
At Tiberium, with a forward-thinking, automate, cloud-first philosophy, we have learnt from the school of hard knocks. We have made all the mistakes above so you do not have to.
This is significant experience that makes us different and enables us to support you effectively, even when the fan is on chopping up the brown stuff and matters have gone ‘the way of the pear’.
Want to know more? Let us tell you about our services and share some experiences with you.
The song Down and Out was written by Paul Williams and is a fantastic part of the film Bugsy Malone.
It starts with ‘Down, down, down, down, down, down, down and out’.
Enough said. Thanks for reading.