Facebook has said that a global outage that recently took its services and internal communications tools offline for several hours was due to a “faulty configuration change” to its routers.
Although all affected apps are now back online, it has still left many wondering what happened, if the situation could have been avoided and whether a similar outage could happen again anytime soon.
The company revealed that a misconfiguration within its BGP routing design was allowed to propagate across its routing fabric internally (iBGP) and then externally (eBGP). Ronan David, VP Business Development and Marketing, EfficientIP, said that while global DNS servers were able to provide resolution to requests for Facebook domains, the public IPs provided in the DNS responses could not be used to route the ensuing external client traffic into Facebook systems. This was exacerbated by the internal DNS architecture at Facebook impacted by the BGP misconfiguration.
BGP did it
BGP (Border Gateway Protocol) is today’s protocol for routing internet traffic, replacing legacy routing protocols such as RIP and OSPF for public internet infrastructure. BGP is responsible for selecting the best available routes to communicate data from a source to a specific destination. David commented:
“Due to Facebook’s continued improvement in reducing their attack surface the issue was further compounded by an inability to access their internal management network (OOB – Out-of-Band), significantly delaying the time to resolve the issue due to not being able to access their own network and fix the configuration; a bit like forgetting your root or admin password and irreversibly losing access to your workstation, though at global internet scale.”
Facebook’s authoritative name servers are advertised to the rest of the internet via border gateway protocol (BGP). David explained that to ensure reliable operation, Facebook’s DNS servers disable BGP advertisements if they themselves can not speak to their data centres. He explained:
“In the recent outage, the entire backbone was removed from operation, making these locations declare themselves unhealthy and withdraw those BGP advertisements. The end result was that their DNS servers became unreachable even though they were still operational. This made it impossible for the rest of the internet to find their servers.”
DNS is the internet’s equivalent to the list of contacts on a phone, which tells a browser what to do by translating a URL into a numbered IP address. The Domain Name System (DNS) is designed to provide translations, converting hostnames, or URLs, to IP addresses. Lavell Juan, CEO of vertically integrated social network company Brag House mentioned that the foundation of any good social network is usability and scalable infrastructure. He also mentioned:
“Ensuring the design and frontend development engages the user and makes all the functionality easily accessible is just as important as having an infrastructure that can grow with and support the user base. From there, it’s all about finding the right languages and frameworks to best create your end product. The most common misconfigurations involve software and data servers. The software can be outdated or missing a key security patch, while servers could require an upgrade or be incorrectly sized. The best way to avoid these issues is through proper documentation and automating processes to reduce manual work.”
Preventing further outages
With the advent of cloud-scale network fabrics that lean extensively on automation both to enable scale and also to remove human error, there is still a human component to the overall process. David explained that the concept of ‘guardrails’ being used to ensure critical infrastructure decisions are controlled and validated before being deployed are absolutely vital to the stability and continuity of services at the internet scale. He explained:
“Guardrails apply not only to the cloud service providers’ management of infrastructure but also to the businesses that build upon these platforms. Owners of websites need to be careful about cloud vendor lock-in and design-in the ability to migrate their business assets and processes to other competing cloud platforms, which in turn puts pressure on these cloud service providers to provide the best possible service or lose their clients.”
Juan pointed the finger at human error causing the majority of interruptions and suggested that testing should be an integral part of the development process and should catch the vast majority of these misconfigurations before they get pushed to production. While gatekeeper may be too strong a term, Amazon, Facebook, Apple, and Google have become custodians of access to some of the largest marketplaces today. As much as the fear of vendor lock-in is acknowledged there is also the fear of marketplace lock-out. David noted:
“All of these businesses apply the tactics and economics of platform strategy – each provides technologies such as identity and authentication for their user base enabling their users to access apps within and across internet ecosystems; without access how will the businesses built upon those platforms reach their customers. Companies have little choice but to ensure they are integrated into these ecosystems and are already, in many cases, entirely dependent on them for their own success. Multi-provider strategies are key though, within the oligopoly of Facebook, Apple, Amazon and Google, technology alone will not be a panacea for mitigation of these risks.”
This does not mean that there are no options available, albeit these options are more likely to be valid for larger businesses. The world has already seen the likes of Netflix and Dropbox migrate away from Amazon to run their own cloud infrastructures. The key takeaway, David says, is that much of the know-how and technology has been commoditized and is available to businesses – however, the availability of a highly skilled workforce is still lacking to take advantage and benefit from this. He concluded:
“Investing in training and organic growth to ensure companies can compete at the same levels of technological maturity must be prioritized as a competitive strategy for tomorrow’s businesses.”
So in short, can big outages like that of Facebook and WhatsApp happen again? Yes – but because this outage was caused by underlying technological issues such as a bug or human error, minimizing these disruptions is something that can be achievable through regular testing throughout the development process.