Who Broke WorldCom's Backbone?
The company issued a statement late yesterday attributing the problem to a “route table issue.” (See WorldCom: Feeling Better Now.) The outage was supposedly caused by a software and routing table update late Wednesday night. As the new software code went live at 8:00 AM on Thursday morning, traffic stopped flowing, according to Matrix Netsystems, a company that measures Internet traffic. Hundreds of routers at 53 points of presence across the country were affected.
Because the outages seemed to occur at peering points -- places in the network where one carrier's traffic is handed over to another carrier -- some people have speculated that the problem was related to border gateway protocol (BGP) routing. They suspect that WorldCom was delivered a bad batch of new software from its routing equipment supplier that caused the problems in the routing tables. “At about 8 a.m., WorldCom’s packet loss went from zero to 22 percent. That’s huge,” says Tom Ohlsson, vice president of marketing and business development for Matrix Netsystems. “They were not passing any traffic.” Most network architects notice performance degradation even with 1 to 2 percent packet loss, he adds.
Nobody is saying for sure which routers caused the outage, but the list of candidates is short. WorldCom has only two suppliers for routing gear: Cisco Systems Inc. (Nasdaq: CSCO) and Juniper Networks Inc. (Nasdaq: JNPR). Both companies have counted WorldCom and its Internet backbone subsidiary UUNet as major customers for years. Until its recent financial problems, WorldCom had consistently been a 10 percent customer for Juniper. Cisco’s relationship with UUNet is also strong. The service provider has been using Cisco's GSR core routers since they were first introduced in 1997. Back then, Cisco announced a $50 million contract with WorldCom to supply it with GSRs and 7500 edge routing gear.
Considering that the problem was likely caused by a software glitch in an upgrade to the operating system, many people have pointed to Cisco gear as the culprit. Cisco’s IOS software, the operating system used to run all of its networking gear, is made up of thousands of lines of code. For this reason, bugs in the software are common. Cisco confirmed that WorldCom suffered another major outage in April when a bug in one its versions of IOS surfaced (see WorldCom's IP Outages: Whodunnit?).
“Odds are it’s probably a Cisco problem,” says Dave Passmore of Burton Group. “Historically, it seems that people have had more problems with Cisco software upgrades. It could be that there are just more of them deployed, so we hear more about it. I don’t know enough about this situation to really say one way or the other if it was Cisco or Juniper.”
Cisco would not comment for this story, but a Juniper spokesperson said that its equipment was not involved in the outage.
Whoever is at fault, the problem was widespread and could potentially cost WorldCom a sizable chunk of change as customers call in for refunds on their service-level agreements. In a statement issued by the carrier last night, it stated that roughly 20 percent of its IP customers were hit by the outage. But the actual number is likely much higher considering that WorldCom also hosts Web servers. Providers AT&T Corp. (NYSE: T) and Sprint Corp. (NYSE: FON) both say their customers experienced delays yesterday in accessing Websites hosted by UUNet. Although these carriers tried to downplay the effects to them as well, Ohlsson or Matrix Netsystems says that it was significant.
“AT&T was hammered just as hard as WorldCom -- and AT&T didn’t do anything wrong,” he says.
He adds that carriers such as Avantal, a service provider in Mexico, suffered massive disruptions. This is because the Mexican carrier partners with UUNet to use its backbone to carry most of its Internet traffic across the U.S. Nearly 65 to 75 percent of all Internet traffic traverses UUNet’s backbone, says Ohlsson.
Businesses across the country were affected. Some had no Internet access, while others experienced delays for most of the day. Companies such as Verisign supposedly lost thousands of transactions yesterday, costing the company substantial business. Light Reading experienced sporadic problems accessing Web servers at its hosting provider, which was connected to the Internet through a UUNet connection.
A big question is whether the outage may have been precipitated by recent troubles at WorldCom. Some suspect that WorldCom’s network management groups are understaffed with overworked engineers. It also seems strange that the company would be attempting a major upgrade in the middle of the week.
“That’s the kind of thing that is usually done at midnight on a Saturday or Sunday night,” says one telecom engineer.
— Marguerite Reardon, Senior Editor, Light Reading