Why CenturyLink's Network Suffered a Christmas Hangover
As 2018 was winding down, CenturyLink experienced what it calls a "network event," an outage that interrupted or, in some cases, impaired services all over the US. The carrier tells Light Reading that the culprit of the outage was an electronic (not virtual) network element in its transport network, where a third-party network management card began creating and spreading "invalid frame packets," flooding the CPUs in its network with congestion and locking them up.
"This event was not caused by a source external to the CenturyLink network, such as a security incident," CenturyLink said, in an email to Light Reading.
The outage impacted "voice, IP, and transport services for some of our customers," CenturyLink said in its email. [Ed. note: So, everything, pretty much.] "The event also impacted CenturyLink's visibility into our network management system, impairing our ability to troubleshoot and prolonging the duration of the outage."
Included in that was an interruption in at least some wireless services and 9-1-1 emergency services in several states. Verizon, for instance, told the Associated Press it had service interruptions in Albuquerque, New Mexico and parts of Montana as a result of issues with CenturyLink.
From behind his giant, clownlike coffee mug on December 28, Federal Communications Commission Chairman Ajit Pai announced that the FCC would investigate the CenturyLink outage because it interrupted 9-1-1 services "across the country."
"I've directed the Public Safety and Homeland Security Bureau to immediately launch an investigation into the cause and impact of this outage," the FCC head said in a statement, several days after the government was shut down over an omnibus funding bill. "This inquiry will include an examination of the effect that CenturyLink's outage appears to have had on other providers' 911 services."
Who is to blame?
CenturyLink told Light Reading that "a faulty network management card from a third-party equipment vendor" caused the outage. Light Reading pressed for more details. We first thought the gear at fault might have been a virtualized network function running on a commercial, off-the-shelf platform. But CenturyLink explained otherwise, saying that the "source was an electronic network element within the transport layer of the CenturyLink network driven by a card supplied by a third-party equipment vendor."
CenturyLink engineers have identified a network element that was impacting customer services and are addressing the issue in order to fully restore services. We estimate services will be fully restored within 4 hours. We apologize for any inconvenience this caused our customers.— CenturyLink (@CenturyLink) December 28, 2018
What happened with the network management card? It went a bit bonkers [Ed. note: And that's us editorializing, not CenturyLink.]
The problem originated in Denver, CenturyLink said in its email to Light Reading. That's where the network card in question began "propagating invalid frame packets that were encapsulated and then sent over the network via secondary communication channels. Once on the secondary communication channel, the invalid frame packets multiplied, forming loops and replicating high volumes of traffic across the network." In turn, this "congested controller card CPUs (central processing units) network-wide, causing functionality issues and rendering many nodes unreachable," CenturyLink explained.
With the network management card acting up, CenturyLink was then faced with a troubling issue -- it had to find the problem and then figure out how to clear out the network traffic that had been created by the malfunctioning network management card. From CenturyLink's description, this involved undoing stuff that had been replicated because, we assume, the network management card was part of the transport network, which is subject to 1-to-1 redundancy.
We discovered some additional technical problems as our service restoration efforts were underway. We continue to make good progress with our recovery efforts and we are working tirelessly until restoration is complete. We apologize for the disruption.— CenturyLink (@CenturyLink) December 28, 2018
Not an easy fix
"Locating the network management card that was sending invalid frame packets across the network took significant analysis and packet captures to be identified as the source as the card was not indicating a malfunction," CenturyLink told Light Reading. "Even after the network management card was removed, the CenturyLink network continued to rebroadcast the invalid packets through the redundant (secondary) communication routes. These invalid frame packets did not have a source, destination, or expiration and had to be cleared out of the network via the application of the polling filters and removal of the secondary communication paths between specific nodes to fully restore service."
As it went along, the repairs got more complicated. "In addition, as repair actions were underway, it became apparent that additional restoration steps were required for certain nodes, which included either line card resets or field operations dispatches for local equipment login," CenturyLink said, adding that its teams "worked around the clock until the issue was resolved."
Even as services were being restored, as is the case with telco networks, they have varying generations of equipment with diverse operational processes that all somehow work in harmony (most times) to provide what looks like, to the consumer, a single, homogenous service. When stuff goes wrong, of course, you need just as many fixes as you have different ways of doing the same thing. "Lingering outages for a small subset of clients were experienced following that time," CenturyLink said. "The remaining impacts were investigated at the individual circuit level and resolved on a case-by-case basis to restore all services to a stable state."
We are aware of some 911 service disruptions affecting various areas through the United States. In case of an emergency, customers should use their wireless phones to call 911 or drive to the nearest fire station or emergency facility. Technicians are working to restore services.— CenturyLink (@CenturyLink) December 28, 2018
The fix has been ongoing, and CenturyLink had to come up with a plan to figure out how to spot the issue more quickly, should it start happening again.
"Secondary communication channels that enabled invalid traffic replication have been disabled networkwide," the carrier told Light Reading. "CenturyLink has established a network monitoring plan for key parameters that can cause this type of outage, based on advice from the third-party equipment vendor. Improvements to the existing monitoring and audits of memory and CPU utilization for this type of issue have been put into place.
"Enhanced visibility processes will quickly identify and terminate invalid packets from propagating the network. This will be jointly and regularly evaluated by the third-party equipment vendor in conjunction with CenturyLink network engineering to ensure the health of the affected nodes," the carrier said, acknowledging that its vendor is actively involved in fixing the problem caused by its gear.
The network event experienced by CenturyLink Thursday has been resolved. Services for business and residential customers affected by the event have been restored.— CenturyLink (@CenturyLink) December 29, 2018
"Affected services began to restore as of December 28, and the network traffic had normalized as of December 29," the carrier said.
— Phil Harvey, US News Editor, Light Reading