CenturyLink CEO: Outage Could Have Been Bigger

CenturyLink's network outage in December would have been much worse if the carrier's patchwork quilt of acquisitions and regional networks were fully integrated. (See Why CenturyLink's Network Suffered a Christmas Hangover).

That bit of reassurance came from Jeff Storey, CenturyLink's president and CEO, who spoke at the 2019 Citi Global TMT West Conference earlier this week in Las Vegas.

While addressing the question of whether merger activity and network integration was partially at fault for the outage, Storey said, "no." He explained that "our approach to integration made it [the outage] smaller than it would have been otherwise."

Storey added: "A lot of companies come in and integrate everything together and make one platform out of all the -- we don't do that. We segment our network because our network is so large, so significant, that we want to have it segmented because things happen. You get fiber cuts, you get equipment failures. And so, our approach to integration actually facilitated it not being bigger than it was by making sure that we don't just haphazardly integrate everything together. We operate a segmented network."

CenturyLink's Broomfield campus. Image courtesy of CenturyLink.
CenturyLink's Broomfield campus. Image courtesy of CenturyLink.

CenturyLink told Light Reading earlier that the outage originated in Denver. That's both close to Level 3's former headquarters and the assets of the former TW Telecom, which Level 3 purchased in 2014. Qwest (formerly US West) is the other massive company in the Denver metro area that CenturyLink acquired in 2011.

Without going into technical detail, Storey did explain what happened to the audience of investors in Las Vegas. He reiterated that the outage was caused by a single piece of equipment from a US vendor. He also noted that, apparently, CenturyLink and its vendor couldn't diagnose or troubleshoot the issue remotely.

"The source of the outage was a particular equipment vendor and a malfunction with one of those cards -- I'm not going to get into all the details, but it created an inability for the system to continue to process capacity and it blocked our ability to control those nodes," Storey said.

"And so, we had to physically go out and shut things down, restart them on that transport layer ... It wasn't something associated with human error; it wasn't an architectural issue. It was an equipment failure that had a more dramatic impact than we would have wanted it to have," Storey said.

He didn't name the vendor but said it "is a US-based company that has been part of our network for a long time. Several of our companies had bought equipment from them, historically, and they had been a great partner for us over a pretty long period of time."

Storey didn't frame the scope of the outage by talking about the number of customers, cities or states affected and he didn't reference the number of 911 calls dropped during the outage. Instead, he compared it to the overall capacity of the carrier's network, a measure that made the outage sound more palatable. "It was a relatively small percentage of our capacity and infrastructure that was down," Storey said. "It was a single platform from a single vendor ... one of our single legacy companies, rather than anything affecting the rest of the transport systems."

Though the term "transport" can be used to generically refer to any internal network problem at a telco, Storey's remarks to investors seem to point to the carrier's optical transport equipment. The December 27 outage, which affected both IP and TDM services, was "at a transport layer and, if you think about the way networks are designed, transport is the fundamental element and other products right on top of that," Storey said.

— Phil Harvey, US News Editor, Light Reading

Clifton K Morris 1/10/2019 | 5:31:04 PM
Re: Is the issue based on unregulated “information service” delivering a regulated service? Well, call it what you like, but anytime there’s equipment that fails in Denver Colorado, it should probably not interfere with emergency services in Boston Massachusetts... that’s not the right kind of redundancy. Still, and as companies shift to third parties for cost savings, perhaps the branded LECs and CLECs and wireless providers like Verizon should be held accountable for negotiating third-party contracts to deliver calls that ideally would have remained local in scope.

Still, and in the wireless situation I mentioned in an earlier comment,, they paid for a charter flight to get new linecards and routers. 3 hours downtime. But Because CenturyLink took two days, they must have hired the late Uber driver I used in Miami last week. Twitter isn’t a redundant communications platform. When seconds count, CenturyLink took to Twitter and essentially told customers that 9-1-1 help is only a 2-hour round trip drive away. 😉👍
brooks7 1/10/2019 | 5:10:33 PM
Re: Is the issue based on unregulated “information service” delivering a regulated service? I had an off-forum conversation with Duh! about a network that I was in recently that had gear in it that was EOL before Cisco was formed.  That was I was musing about in the other thread.  In this case, I was thinking about X.25 gear that was in the OSS network of this carrier.  

And just remember lots of old voice equipment had real PROM based firmware that was not changeable...(go back and look at POTS card requirements in TR-57 for example).

Clifton K Morris 1/10/2019 | 4:07:42 PM
Re: Is the issue based on unregulated “information service” delivering a regulated service? Well, one would think so. However, this isn’t the first time I read about a line card going bad. Another issue, is that many manufacturers thought lasers and LED diodes won’t dim after thousands of hours of use; and they were marketed as products that never need replacement.

Still, something similar occurred with new Juniper equipment deployed in a Ericsson/wireless setting which also relied on Level 3 as a service provider.

After a lot of blame-games and finger pointing between light-wave service providers, the formal notes filed with the FCC placed responsibility squarely on “firmware issues” however, the line cards were replaced too.

As linecards start to age, especially those that connect legacy networks, I expect to see these types of occurrence become more frequent, especially in companies whom lack a robust CMDB (Configuration Management Database) and operational-company policy which mandates a record of changes. Two days’ to find out why 9-1-1 isn’t routing correctly is a long time and somewhat indicative of a lack of asset information.

It’s not like Storey could say the blame is due to a hurricane or weather and, by virtue of being a “Internet Wholesaler” it’s not like Storey has the power to issue a bill credit to a Verizon Customer whom was affected by the outage.

When I was at Company X, they had a network based on legacy Cisco Equipment. Eventually, the hardware was EOL’ed. Instead of making the priority of rip-replacing the EOL’ed network hardware to something covered, the people at X started sourcing parts from eBay. On the day I was about to give a report, I learned of a FBI bust where DoJ and FBI had deployed parts from a vendor that was sourcing “new-old-stock from eBay but we’re actually knockoffs. Suddenly every project was green-lit for rip-and-replace. Company X had “Gold Status” on Cisco RMAs and didn’t question any return/repair.

Some of this early equipment will eventually go out. It should be a priority to replace as part of a normal operating budget and before the “rip” is mandated. Citation: “Departments of Justice and Homeland Security Announce International Initiative Against Traffickers in Counterfeit Network Hardware”

brooks7 1/10/2019 | 2:40:57 PM
Re: Is the issue based on unregulated “information service” delivering a regulated service? Theoretically, there is greater redundancy in the ip network than in the voice network.


Clifton K Morris 1/10/2019 | 2:00:04 PM
Is the issue based on unregulated “information service” delivering a regulated service? This is one of those reasons why voice services for 9-1-1 are generally provided on dedicated, redundant circuits. When a “information services” provider such as Level(3) or CL translates those calls to VoIP or data service, a similar quality of service (including redundancy) should be required, similar to that required for regulated telecom service counterpart.
Duh! 1/10/2019 | 1:36:07 PM
More clarity The story changed subtly since the day after the event. Now the misfunction was in one of the vendor's cards, not in a third-party's. That plus some additional bits gives credence to Fred Goldstein's theory that it was a hardware problem that caused a GMPLS issue that propagated through the control plane.

It also narrows the field as to which vendor it was (and I'm not naming names). 

Sign In