WorldCom's IP Outages: Whodunnit?
The outages spurred a flurry of debate and speculation among email posters on the North American Network Operators' Group (NANOG) mailing list.
WorldCom officials blame the problem on a train derailment that occurred in Ohio, 50 miles south of Toledo, resulting in fiber cuts. Meanwhile, independent engineers pointed to Cisco Systems Inc. (Nasdaq: CSCO) routers, which Cisco officials later confirmed. But the bottom line may be: If there's a fiber cut or router problem, isn't the network supposed to stay up anyway?
According to Linda Laughlin, a spokesperson for WorldCom, two separate fibers were cut. One was damaged when the train accident occurred at around 6:30 AM Central Time. The other fiber was cut later in the day, when cleanup crews were removing wreckage from the site. WorldCom immediately deployed crews to repair the fibers, says Laughlin.
But ISP network engineers on Nanog say that UUNet engineers are telling them a different story. They say the issue is linked to the Cisco routers deployed in UUNet’s network.
Cisco confirms there were problems with its routers in the UUNet network today. According to Martin McNealis, director of IP product management, there was a bug in an older version of Cisco's IOS routing software that only appears in certain instances when the IS-IS routing protocol is running. McNealis says Cisco discovered the problem well over a year ago and has fixed it in its more recent versions of IOS. But he says UUNet was running an older version of the software that did not have the patch.
The bug caused memory corruption in several Cisco routers, wiping out entire routing tables and causing delays while routers rebooted and repopulated their routing tables. The problem continued all morning, affecting ISPs across the country from Boston to Memphis to San Francisco (www.lightreading.com was among those affected).
Richard Steenbergen, an independent network engineering consultant, says he experienced a similar situation with another inter-domain routing protocol, OSPF, which crashed several Cisco GSR 12000 routers at another large tier-one carrier a couple of years back. He says the bug and the series of events that triggered it would not likely appear in testing.
Steenbergen blames Cisco’s apparent router instability on its IOS routing software.
"Because of its monolithic design and lack of protected memory space for individual components, IOS is notorious for bringing down the entire router if so much as a single error occurs," he says.
But Cisco's McNealis says that if the same problem occured in any other router, such as one from Juniper Networks Inc. (Nasdaq: JNPR), it would have had the same effect.
"When you have a memory corruption problem and you lose the routing tables, it takes time for the routers to talk to each other," he says. "There may be variations in recovery time, but in a similar situation an outage would have also occured in a Juniper router." Officially, WorldCom is sticking to its story and has not issued any statement about a router problem. But McNealis says this is the first he has heard of a fiber cut.
— Marguerite Reardon, Senior Editor, Light Reading