Monitoring & Assurance

Why CenturyLink's Network Suffered a Christmas Hangover

As 2018 was winding down, CenturyLink experienced what it calls a "network event," an outage that interrupted or, in some cases, impaired services all over the US. The carrier tells Light Reading that the culprit of the outage was an electronic (not virtual) network element in its transport network, where a third-party network management card began creating and spreading "invalid frame packets," flooding the CPUs in its network with congestion and locking them up.

"This event was not caused by a source external to the CenturyLink network, such as a security incident," CenturyLink said, in an email to Light Reading.

The outage impacted "voice, IP, and transport services for some of our customers," CenturyLink said in its email. [Ed. note: So, everything, pretty much.] "The event also impacted CenturyLink's visibility into our network management system, impairing our ability to troubleshoot and prolonging the duration of the outage."

Included in that was an interruption in at least some wireless services and 9-1-1 emergency services in several states. Verizon, for instance, told the Associated Press it had service interruptions in Albuquerque, New Mexico and parts of Montana as a result of issues with CenturyLink.

From behind his giant, clownlike coffee mug on December 28, Federal Communications Commission Chairman Ajit Pai announced that the FCC would investigate the CenturyLink outage because it interrupted 9-1-1 services "across the country."

"I've directed the Public Safety and Homeland Security Bureau to immediately launch an investigation into the cause and impact of this outage," the FCC head said in a statement, several days after the government was shut down over an omnibus funding bill. "This inquiry will include an examination of the effect that CenturyLink's outage appears to have had on other providers' 911 services."

This is either a stock photo or it's from today's editorial meeting at Light Reading.
This is either a stock photo or it's from today's editorial meeting at Light Reading.

Who is to blame?
CenturyLink told Light Reading that "a faulty network management card from a third-party equipment vendor" caused the outage. Light Reading pressed for more details. We first thought the gear at fault might have been a virtualized network function running on a commercial, off-the-shelf platform. But CenturyLink explained otherwise, saying that the "source was an electronic network element within the transport layer of the CenturyLink network driven by a card supplied by a third-party equipment vendor."

What happened with the network management card? It went a bit bonkers [Ed. note: And that's us editorializing, not CenturyLink.]

The problem originated in Denver, CenturyLink said in its email to Light Reading. That's where the network card in question began "propagating invalid frame packets that were encapsulated and then sent over the network via secondary communication channels. Once on the secondary communication channel, the invalid frame packets multiplied, forming loops and replicating high volumes of traffic across the network." In turn, this "congested controller card CPUs (central processing units) network-wide, causing functionality issues and rendering many nodes unreachable," CenturyLink explained.

With the network management card acting up, CenturyLink was then faced with a troubling issue -- it had to find the problem and then figure out how to clear out the network traffic that had been created by the malfunctioning network management card. From CenturyLink's description, this involved undoing stuff that had been replicated because, we assume, the network management card was part of the transport network, which is subject to 1-to-1 redundancy.

Not an easy fix
"Locating the network management card that was sending invalid frame packets across the network took significant analysis and packet captures to be identified as the source as the card was not indicating a malfunction," CenturyLink told Light Reading. "Even after the network management card was removed, the CenturyLink network continued to rebroadcast the invalid packets through the redundant (secondary) communication routes. These invalid frame packets did not have a source, destination, or expiration and had to be cleared out of the network via the application of the polling filters and removal of the secondary communication paths between specific nodes to fully restore service."

As it went along, the repairs got more complicated. "In addition, as repair actions were underway, it became apparent that additional restoration steps were required for certain nodes, which included either line card resets or field operations dispatches for local equipment login," CenturyLink said, adding that its teams "worked around the clock until the issue was resolved."

Even as services were being restored, as is the case with telco networks, they have varying generations of equipment with diverse operational processes that all somehow work in harmony (most times) to provide what looks like, to the consumer, a single, homogenous service. When stuff goes wrong, of course, you need just as many fixes as you have different ways of doing the same thing. "Lingering outages for a small subset of clients were experienced following that time," CenturyLink said. "The remaining impacts were investigated at the individual circuit level and resolved on a case-by-case basis to restore all services to a stable state."

The fix has been ongoing, and CenturyLink had to come up with a plan to figure out how to spot the issue more quickly, should it start happening again.

"Secondary communication channels that enabled invalid traffic replication have been disabled networkwide," the carrier told Light Reading. "CenturyLink has established a network monitoring plan for key parameters that can cause this type of outage, based on advice from the third-party equipment vendor. Improvements to the existing monitoring and audits of memory and CPU utilization for this type of issue have been put into place.

"Enhanced visibility processes will quickly identify and terminate invalid packets from propagating the network. This will be jointly and regularly evaluated by the third-party equipment vendor in conjunction with CenturyLink network engineering to ensure the health of the affected nodes," the carrier said, acknowledging that its vendor is actively involved in fixing the problem caused by its gear.

"Affected services began to restore as of December 28, and the network traffic had normalized as of December 29," the carrier said.

— Phil Harvey, US News Editor, Light Reading

Page 1 / 2   >   >>
Duh! 1/2/2019 | 4:16:16 PM
So many questions Well, that was clear as mud.

What exactly is a "third-party network management card" and what is an "invalid frame packet"? [I think I can guess] And how did a broadcast storm in a "secondary channel" (OTN General Communications Channel?) wipe out working paths?

That's the first few that came to mind.  Lots more where they came from.
Phil Harvey 1/2/2019 | 4:58:04 PM
Re: So many questions Hi, Duh!

Yes, CTL isn't going to directly call out the specific vendor product, sku or software provider. We'll find out at some point and report it.

I think the takeaway is that, as my colleague Ray put it, a company that manages networks for a living just had a massive network management problem that it couldn't find, didn't know how to fix, etc. 

That said, please keep the questions coming. We're hoping to have more to report in the next few days.

Keebler 1/2/2019 | 5:07:07 PM
Reminiscent of TARP storms of old The event reminds me of the TARP storms that plagued SONET networks when TARP was first introduced. TARP messages would replicate at the gateways to rings in both directions, circulate the ring, and get replicated again. Even with time-to-live settings, the amount of traffic quickly overwhelmed the systems and resulted in outages. It was hard to find and nontrivial to fix.

Sounds like those who forget history are doomed to repeat it. Or something along those lines.

Anyone taking bets yet on who the third party equipment vendor was this time?
Phil Harvey 1/2/2019 | 5:08:42 PM
Re: Reminiscent of TARP storms of old Good call back. Was that something on one of the old RBOC networks -- US West or SBC? 
Keebler 1/2/2019 | 5:12:46 PM
Re: Reminiscent of TARP storms of old It was definitely on an RBOC network. Around 1997 I believe. My memory isn't quite good enough to recall exactly which one, but maybe Ameritech? That could be completely off. I usually throw out the Ameritech name just to confuse the youngsters.
Phil Harvey 1/2/2019 | 5:17:01 PM
Re: Reminiscent of TARP storms of old That was even years before Verizon started hiring extra creepy white guys as their star pitchmen. 
brooks7 1/2/2019 | 5:37:45 PM
Re: Reminiscent of TARP storms of old The first thing I thought was...somebody still has x.25 in their oss network.


Phil Harvey 1/2/2019 | 5:56:05 PM
Re: Reminiscent of TARP storms of old And that (x.25) is a pre-IP networking way of getting switches to connect to OSS systems/carrier back offices?

If so, then there would be some kind of gateway sitting between the (presumably really old) switch and the IP network? 

brooks7 1/2/2019 | 7:14:36 PM
Re: Reminiscent of TARP storms of old That is correct Phil.  I recently looked at a network that had such gear still in place with an x,25 switch maker that went out of business 20 years ago or so.


Edit:  And yes it was used to connect to the systems for OSMINE.
Duh! 1/2/2019 | 10:49:57 PM
Re: Reminiscent of TARP storms of old X.25 is the absolutely last protocol suite I would ever associate with packet storms. Heavyweight flow control was one of it's main architectural principles.
Page 1 / 2   >   >>
Sign In