& cplSiteName &

Why CenturyLink's Network Suffered a Christmas Hangover

Phil Harvey
1/2/2019
50%
50%

As 2018 was winding down, CenturyLink experienced what it calls a "network event," an outage that interrupted or, in some cases, impaired services all over the US. The carrier tells Light Reading that the culprit of the outage was an electronic (not virtual) network element in its transport network, where a third-party network management card began creating and spreading "invalid frame packets," flooding the CPUs in its network with congestion and locking them up.

"This event was not caused by a source external to the CenturyLink network, such as a security incident," CenturyLink said, in an email to Light Reading.

The outage impacted "voice, IP, and transport services for some of our customers," CenturyLink said in its email. [Ed. note: So, everything, pretty much.] "The event also impacted CenturyLink's visibility into our network management system, impairing our ability to troubleshoot and prolonging the duration of the outage."

Included in that was an interruption in at least some wireless services and 9-1-1 emergency services in several states. Verizon, for instance, told the Associated Press it had service interruptions in Albuquerque, New Mexico and parts of Montana as a result of issues with CenturyLink.

From behind his giant, clownlike coffee mug on December 28, Federal Communications Commission Chairman Ajit Pai announced that the FCC would investigate the CenturyLink outage because it interrupted 9-1-1 services "across the country."

"I've directed the Public Safety and Homeland Security Bureau to immediately launch an investigation into the cause and impact of this outage," the FCC head said in a statement, several days after the government was shut down over an omnibus funding bill. "This inquiry will include an examination of the effect that CenturyLink's outage appears to have had on other providers' 911 services."

This is either a stock photo or it's from today's editorial meeting at Light Reading.
This is either a stock photo or it's from today's editorial meeting at Light Reading.

Who is to blame?
CenturyLink told Light Reading that "a faulty network management card from a third-party equipment vendor" caused the outage. Light Reading pressed for more details. We first thought the gear at fault might have been a virtualized network function running on a commercial, off-the-shelf platform. But CenturyLink explained otherwise, saying that the "source was an electronic network element within the transport layer of the CenturyLink network driven by a card supplied by a third-party equipment vendor."

What happened with the network management card? It went a bit bonkers [Ed. note: And that's us editorializing, not CenturyLink.]

The problem originated in Denver, CenturyLink said in its email to Light Reading. That's where the network card in question began "propagating invalid frame packets that were encapsulated and then sent over the network via secondary communication channels. Once on the secondary communication channel, the invalid frame packets multiplied, forming loops and replicating high volumes of traffic across the network." In turn, this "congested controller card CPUs (central processing units) network-wide, causing functionality issues and rendering many nodes unreachable," CenturyLink explained.

With the network management card acting up, CenturyLink was then faced with a troubling issue -- it had to find the problem and then figure out how to clear out the network traffic that had been created by the malfunctioning network management card. From CenturyLink's description, this involved undoing stuff that had been replicated because, we assume, the network management card was part of the transport network, which is subject to 1-to-1 redundancy.

Not an easy fix
"Locating the network management card that was sending invalid frame packets across the network took significant analysis and packet captures to be identified as the source as the card was not indicating a malfunction," CenturyLink told Light Reading. "Even after the network management card was removed, the CenturyLink network continued to rebroadcast the invalid packets through the redundant (secondary) communication routes. These invalid frame packets did not have a source, destination, or expiration and had to be cleared out of the network via the application of the polling filters and removal of the secondary communication paths between specific nodes to fully restore service."

As it went along, the repairs got more complicated. "In addition, as repair actions were underway, it became apparent that additional restoration steps were required for certain nodes, which included either line card resets or field operations dispatches for local equipment login," CenturyLink said, adding that its teams "worked around the clock until the issue was resolved."

Even as services were being restored, as is the case with telco networks, they have varying generations of equipment with diverse operational processes that all somehow work in harmony (most times) to provide what looks like, to the consumer, a single, homogenous service. When stuff goes wrong, of course, you need just as many fixes as you have different ways of doing the same thing. "Lingering outages for a small subset of clients were experienced following that time," CenturyLink said. "The remaining impacts were investigated at the individual circuit level and resolved on a case-by-case basis to restore all services to a stable state."

The fix has been ongoing, and CenturyLink had to come up with a plan to figure out how to spot the issue more quickly, should it start happening again.

"Secondary communication channels that enabled invalid traffic replication have been disabled networkwide," the carrier told Light Reading. "CenturyLink has established a network monitoring plan for key parameters that can cause this type of outage, based on advice from the third-party equipment vendor. Improvements to the existing monitoring and audits of memory and CPU utilization for this type of issue have been put into place.

"Enhanced visibility processes will quickly identify and terminate invalid packets from propagating the network. This will be jointly and regularly evaluated by the third-party equipment vendor in conjunction with CenturyLink network engineering to ensure the health of the affected nodes," the carrier said, acknowledging that its vendor is actively involved in fixing the problem caused by its gear.

"Affected services began to restore as of December 28, and the network traffic had normalized as of December 29," the carrier said.

— Phil Harvey, US News Editor, Light Reading

(19)  | 
Comment  | 
Print  | 
Newest First  |  Oldest First  |  Threaded View        ADD A COMMENT
<<   <   Page 2 / 2
brooks7
50%
50%
brooks7,
User Rank: Light Sabre
1/2/2019 | 7:14:36 PM
Re: Reminiscent of TARP storms of old
That is correct Phil.  I recently looked at a network that had such gear still in place with an x,25 switch maker that went out of business 20 years ago or so.

seven

Edit:  And yes it was used to connect to the systems for OSMINE.
Phil Harvey
50%
50%
Phil Harvey,
User Rank: Light Sabre
1/2/2019 | 5:56:05 PM
Re: Reminiscent of TARP storms of old
And that (x.25) is a pre-IP networking way of getting switches to connect to OSS systems/carrier back offices?

If so, then there would be some kind of gateway sitting between the (presumably really old) switch and the IP network? 

 
brooks7
50%
50%
brooks7,
User Rank: Light Sabre
1/2/2019 | 5:37:45 PM
Re: Reminiscent of TARP storms of old
The first thing I thought was...somebody still has x.25 in their oss network.

 

seven
Phil Harvey
50%
50%
Phil Harvey,
User Rank: Light Sabre
1/2/2019 | 5:17:01 PM
Re: Reminiscent of TARP storms of old
That was even years before Verizon started hiring extra creepy white guys as their star pitchmen. 
Keebler
50%
50%
Keebler,
User Rank: Moderator
1/2/2019 | 5:12:46 PM
Re: Reminiscent of TARP storms of old
It was definitely on an RBOC network. Around 1997 I believe. My memory isn't quite good enough to recall exactly which one, but maybe Ameritech? That could be completely off. I usually throw out the Ameritech name just to confuse the youngsters.
Phil Harvey
50%
50%
Phil Harvey,
User Rank: Light Sabre
1/2/2019 | 5:08:42 PM
Re: Reminiscent of TARP storms of old
Good call back. Was that something on one of the old RBOC networks -- US West or SBC? 
Keebler
50%
50%
Keebler,
User Rank: Moderator
1/2/2019 | 5:07:07 PM
Reminiscent of TARP storms of old
The event reminds me of the TARP storms that plagued SONET networks when TARP was first introduced. TARP messages would replicate at the gateways to rings in both directions, circulate the ring, and get replicated again. Even with time-to-live settings, the amount of traffic quickly overwhelmed the systems and resulted in outages. It was hard to find and nontrivial to fix.

Sounds like those who forget history are doomed to repeat it. Or something along those lines.

Anyone taking bets yet on who the third party equipment vendor was this time?
Phil Harvey
50%
50%
Phil Harvey,
User Rank: Light Sabre
1/2/2019 | 4:58:04 PM
Re: So many questions
Hi, Duh!

Yes, CTL isn't going to directly call out the specific vendor product, sku or software provider. We'll find out at some point and report it.

I think the takeaway is that, as my colleague Ray put it, a company that manages networks for a living just had a massive network management problem that it couldn't find, didn't know how to fix, etc. 

That said, please keep the questions coming. We're hoping to have more to report in the next few days.

-ph
Duh!
50%
50%
Duh!,
User Rank: Blogger
1/2/2019 | 4:16:16 PM
So many questions
Well, that was clear as mud.

What exactly is a "third-party network management card" and what is an "invalid frame packet"? [I think I can guess] And how did a broadcast storm in a "secondary channel" (OTN General Communications Channel?) wipe out working paths?

That's the first few that came to mind.  Lots more where they came from.
<<   <   Page 2 / 2
Featured Video
Flash Poll
Upcoming Live Events
September 17-19, 2019, Dallas, Texas
October 1, 2019, New Orleans, Louisiana
October 2-22, 2019, Los Angeles, CA
October 10, 2019, New York, New York
November 5, 2019, London, England
November 7, 2019, London, UK
December 3-5, 2019, Vienna, Austria
December 3, 2019, New York, New York
All Upcoming Live Events
Partner Perspectives - content from our sponsors
Transform Beyond Borders to Lead the Innovation
By Ben Zhou, CEO, Whale Cloud
Reject Limits. Build the Future.
By David Wang, Huawei
China Telecom & Huawei Jointly Complete the World's First End-to-End 5G SA Voice & Video Call
By Jay Liu, Senior Marketing Manager, Cloud Core Product Line, Huawei Technologies
All Partner Perspectives