IP Reliability

A survey of how vendors are trying to achieve 'five 9s' of availability * Why it's hot * What needs fixing * Who's doing what

March 6, 2003

21 Min Read
IP Reliability

What will it take for Internet Protocol (IP) to finally live up to its promise of being able to radically simplify telecom networks?

The answer comes down to one word: reliability. When (and if) IP networks start achieving the magic “five 9s” of availability already achieved in legacy ATM and TDM networks, carriers might start countenancing radical simplification of their network architectures – simplication that will lead to big improvements in profitability.

For a kickoff, if the average router was out of commission for less than five minutes every year (which is what 99.999% availability means), then carriers might be able to abandon their current practice of deploying routers in pairs – one acting as a hot standby for the other.

And that’s just for starters. If router networks hardly ever went wrong, carriers might also start giving serious thought to eliminating whole layers of underlying infrastructure, such as Asynchronous Transfer Mode (ATM) equipment.

On top of that, it might encourage them to move towards the idea of carrying all types of media – voice, data, and video – over a common IP backbone. That equates to being able to roll out new services (and generate new revenues) at the same time as cutting costs.

Right now, we're a long way from reaching these IP goals, but router vendors are putting a lot of effort into improving reliabity. In fact, it was the No. 1 expenditure for most vendors' R&D departments last year.

Alcatel SA (NYSE: ALA; Paris: CGEP:PA), Avici Systems Inc. (Nasdaq: AVCI; Frankfurt: BVC7), and Cisco Systems Inc. (Nasdaq: CSCO) are among those making the most noise about IP reliability. These companies, along with many others, have announced improvements to their gear that will tune routing protocols that IP routers use to communicate with one another. By doing so, they claim, routers will no longer drop packets and will be better able to switch over traffic when a failure occurs, without an interruption in service. These advances are enticing to most carriers, but it’s still very early days.

So what are vendors doing to improve IP reliability?

Three things:

  • They’re working on network-layer standards (such as Multiprotocol Label Switching (MPLS) fast reroute);

  • They’re also making improvements to their router code;

  • And they are developing new hardware configurations for their products.

Simply put, the big difference among these approaches is that the last two – proprietary hardware and software – typically address device reliability, or how to make sure that the router itself does not fail. The first one is concerned with network reliability, or what to do when a network outage occurs.

In order to attain five-9s reliability, service providers need both. For example, it doesn’t make a great deal of difference how reliable a router itself is if a network connection is cut and there’s no way to route around that outage. Conversely, you can have the most reliably configured network in the world, but that won’t do any good if your router keeps crashing.

This report focuses on vendor hardware and software and gives a broad overview of network-level reliability. Here’s a hyperlinked summary:

  • The Goal: Five 9s Reliability

  • Causes of Router Failure

  • Router Redundancy: A Pricey Fix

  • The Players

  • Protection Approaches

  • Nonstop Routing

  • Software and Hardware Approaches

  • Network Reliability

Introduction by Peter Heywood, Founding Editor, Light Reading

Report by Marguerite Reardon, Senior Editor, Light Reading

IP vendors are targeting so-called five 9s of reliability. This chart shows what that means in terms of downtime on a service provider network.

Table 1: Percentage of Uptime = Network Availability

Percentage of Uptime

Downtime Per Year

90

up to 36 days

99

up to 3.7 days

99.9

up to 9 hours

99.99

up to 53 minutes

99.999

up to 5 minutes



A network that is unavailable for a period of five minutes or less in a year has a rating of 99.999 percent – i.e., five 9s.

While a network that is up for 99.9 percent of the time may sound pretty reliable, in fact, this translates to nine hours of downtime in a year.

“IP was designed as a best-effort technology,” says Dave Garbin, vice president of network strategy at Cable & Wireless (NYSE: CWP). “But as IP takes the place of traditional ATM and Frame Relay services and ultimately carries voice and video traffic, too, it will have to match the reliability of traditional telco systems.” That means five 9s, because that's what legacy ATM and Frame Relay equipment typically achieves.

The problem is that traditional IP routing equipment doesn’t currently come close to achieving the 99.999 percent availability. In fact, IP networks still suffer from significant downtime – on average, from two to six hours per year, or only three 9s of reliability.

This is partly because IP equipment is inherently more unstable and unpredictable than circuit technologies like ATM and Frame Relay, which were designed from the ground up with reliability in mind. IP gear also has to cope with keeping vast address tables updated using BGP and other routing protocols.

Of course, to say that IP networks are completely unreliable is not true either. In carrier networks, IP runs on top of a resilient optical transmission layer, which provides physical-layer resiliency. Carriers also have improved reliability in the core by building highly meshed networks with redundant routers on standby for failures. Yet even so, IP still has a long way to go before it matches up to service providers' expectations.

“High availability has never been at the core of the IP community,” says Mark Seery, recently an analyst with RHK Inc. who is now an independent consultant. “Their answer has traditionally been to let another router reroute traffic around a problem. But that approach causes an interruption in service, and that’s unacceptable for carriers who want to provide SLAs for 99.999 percent availability.”

The challenge faced by network equipment vendors today is how to make comparatively inexpensive and unstable IP gear as reliable as its circuit-based counterparts – without redesigning the technology from scratch, or making it unfeasibly expensive.

“The best way to do that is to borrow from the traditional telephony world, where they don’t deploy two Class 5 switches side-by-side for redundancy,” says Richard Norman, president and CTO of core routing startup Hyperchip Inc. ”They make everything inside the box redundant.”

Cable & Wireless’s Garbin agrees that much of today’s work on IP reliability must focus on making the IP routing gear more reliable and resilient – something he says is not the case today, as many routers still lack key redundant hardware components like route processors and line cards.

So what is the root cause of the problem? It turns out that there are several, according to Network Strategy Partners LLC. In a study conducted for Alcatel, the firm found that 31 percent of outages occur during either hardware or software upgrades.

The single biggest culprit, however, is the router software itself, which causes 25 percent of all outages.

“Generally speaking, the hardware isn’t the issue,” says analyst Seery. “It’s the software. There has already been a lot of work done on the hardware side. More is still needed, but the real work that is left to do is in software.”

Within these software-related outages, it’s the control plane that is the biggest problem – causing 60 percent of glitches.

“The control plane of an IP router is different from a Frame Relay or ATM switch control plane,” says Carey Parker, vice president of marketing for Chiaro Networks. “Those control planes are basically static, whereas an IP control is constantly dealing with updates.”

As noted, carriers need more reliable IP equipment if they want to make money from next-generation IP services. But problems with IP reliability are also costing them money right now in two ways.

In order to maintain even a semblance of respectable uptime, Tier 1 carriers are deploying two core routers at each site for redundancy, as shown in the figure below. This adds a lot of additional capital and operational costs. In fact, not only does it double the capital expenditure required when upgrading networks to a new platform, but carriers take a hit on the operational side as well. More boxes take up more space. They require more electricity and more man-hours to manage and maintain.

Eliminating dual router configurations would allow carriers to reduce combined capex and opex by an average of 20 percent to 44 percent over a five-year period, according to Network Strategy Partners. Those savings can in turn be reinvested in the network – helping to generate new revenues.

“We are trying to achieve high availability with too much equipment,” says C&W's Garbin. “If we can design highly reliable networks with fewer boxes, and it’s cheaper than how we do it today, we’ll do that. Over time, I think this will be the case.”

Some experts also argue that not only is the redundant router configuration expensive, it doesn’t really fix the problem.

“It can take up anywhere between five and 15 minutes for traffic to failover from the main router to its hot standby,” says John Nakulski, product manager with test equipment vendor Agilent Technologies Inc. (NYSE: A). “In that time, route flapping can occur, making the network unstable while the network tries to reconverge. Often, a second outage occurs.”

And on top of equipment and maintenance costs, network downtime also costs big bucks in its own right.

Again according to Network Strategy Partners, in any year an average Tier 1 carrier experiences nearly 300 outages, costing $13.9 million in service-level agreement penalties, loss of productivity, and network churn. NSP says that using highly reliable IP equipment would reduce the total number of outages to about 20 per year at a cost of $962,000 – a reduction of more than 90 percent.

If and when routers attain the same levels of reliability as ATM and TDM networks, the savings for carriers could be very significant, notes Garbin: “The bottom line is that we won’t have to keep building out and maintaining separate ATM and Frame Relay networks and a separate TDM network for voice. We can build one highly reliable IP/MPLS core to run services. We’ll then be able to put more resources into developing more IP services.”

Garbin adds that it will still be a long time before redundant routers are completely eliminated from the network. He says carriers will need to test and retest gear to make sure it's reliable enough to stand on its own. But, he adds, improved router reliability is part of the expected evolution, and he and other providers are looking for this now when they evaluate gear.

Every single IP routing vendor and Layer 3 switch vendor interviewed by Light Reading for this report claims to be working on IP reliability improvements. That’s not really surprising. But what is surprising is that reliability was also the No. 1 expenditure for most vendors’ R&D departments in 2002.

“I think 2002 was a watershed year for router IP reliability,” says analyst Seery. “Vendors made huge progress in terms of developing features. We’ll have to see in 2003 if it all works.”

Some of these vendors’ products are designed to operate in different parts of the network, such as the edge or the core, but many IP reliability features apply equally well wherever the product is installed.

For instance, one key method used by all the vendors to improve reliability is to separate the data plane from the control plane on the route processor. This ensures that the data plane, which forwards packets, can continue functioning even if the control plane, which is essentially the brains of the router, fails.

Similarly, many of the other features – such as software modularity, hitless software upgrades, and redundant hardware – are just as relevant at the edge as they are the core.

Here is a table that surveys the different IP routing and Layer 3 switching players in the market.



Dynamic Table: Competitive Analysis of Vendors & Products

Select fields:
Show All Fields
Vendor URL Device Type Company Status Product Name(s) Protocol Extensions Nonstop Routing Software Modularity Hitless Software Upgrades Hot Stand-By Route Processors Logical Router MPLS Fast Reroute Customer Deployments

Understanding IP reliability isn’t easy, especially when vendors use similar terms to refer to different things. For example, Cisco talks about Stateful Switchover, which is different from Stateful Failover, a term favored by Avici. Then there’s Cisco’s Nonstop Forwarding. This is different from Nonstop Routing, a term that both Avici and Alcatel use to describe their solutions.

The rest of this report will go through the various reliability implementations, explaining exactly what each one means and listing who is doing what.

Protocol extension and nonstop routing (covered on the next page) are designed to do the same thing: ensure that a router can continue to forward traffic, even in the event that there is a problem with its control plane or route processor.

This diagram shows how the two approaches, which are being most hotly debated within the industry today, work.

So how do they work? With the first approach – protocol extension, shown at the top of the diagram – a router whose route processor or control plane has failed first terminates all sessions between itself and other network devices.

It then uses so-called “extensions” or additions to routing protocols (BGP, OSPF, IS-IS, LDP) to automatically signal the other routers in a network that they should continue forwarding and receiving packets to and from it, which prevents a complete outage or an ongoing outage.

The router with the failed processor continues to forward packets using the old routing information stored in its routing tables, even though its route processor isn’t working. When it returns to full service, it notifies the surrounding peer routers that it is functioning properly again, and they then send routing table updates to it so that it can build a new, updated routing table.

Nonstop routing, the second approach, works differently. In this case, the router uses one or more backup or hot standby route processors. When a failure occurs, the router seamlessly fails over from the primary to the secondary processor, which maintains a copy of the device’s routing information or “routing state.”

This approach works without the need to terminate routing sessions. As a result, surrounding routers never know a failure has occurred. And there’s no interruption to service.

The benefit of using protocol extension is that it can be implemented by carriers on existing routers without installing backup route processors, thereby saving money.

Cisco and Juniper Networks Inc. (Nasdaq: JNPR), two key vendors using this approach, argue that its greatest benefit is simplicity. The extensions being proposed are straightforward, they say, and much easier to implement than keeping track of all the routing information in a network, one of the requirements of nonstop routing.

But there are also a couple of potential drawbacks with protocol extension. The biggest issue, according to critics, is that the technique could potentially cause routing loops or "black holes" in the network if routing information changes before the recovered router is able to complete its updates and convergence.

”It’s important to check how fast the reconvergence happens,” says Nakulski from Agilent.

Another drawback of this implementation is that router vendors must implement extensions for every routing protocol they support – BGP, OSPF, IS-IS, LDP, and so on. And ultimately, all of these extensions will have to be standardized. Today, the Internet Engineering Task Force (IETF) is working on several drafts:



Because nonstop routing doesn’t require protocol extensions or any external communications with other routers, all implementations can be proprietary. There is no need for standards development or risk of interoperability problems.

Still, it’s worth noting that while all three of the service providers interviewed for our report said they liked the concept of nonstop routing very much, two of them expressed the concern that it was as yet unproven and might be overly complicated to deploy.

”If vendors can really do nonstop routing and failover without any interruption of service, that would be great,” says Prodip Sen, director of data and service architecture for Verizon Communications Inc. (NYSE: VZ). “It may look great on paper, but I’m not sure it can be achieved.”Furthermore, critics claim that nonstop routing has its own technical problems – specifically, that the process of mirroring or copying routing table updates on a backup processor could cause a software bug affecting the primary processor to be copied to the backup, thereby replicating the problem and causing the router to fail. Others question the scaleability of this solution.

All of the router vendors supporting nonstop routing say that they have developed enhancements to their products that overcome the problems of replication and scaleability.

For example, Alcatel uses sophisticated software and high-speed hardware to scan all routing information prior to mirroring it onto the standby unit. During a control-plane failure, bad packets, or peers detected as causing the failure of the control plane, are automatically dropped and prevented from being restarted on the standby control plane. All of this occurs quickly enough so that the backup control plane is placed online without a single TCP session being interrupted.

The approach employed by Avici requires the two processors to be more loosely coupled. In this solution, routing table updates are stored in a repository somewhere else in the router. The backup processor stores enough information about the router’s state to keep the routing sessions alive. When a failure occurs, the primary immediately fails over to the backup processor and packets continue to be forwarded using the old routing table information. The backup processor then accesses the repository to relearn the updated routing information.

This solution helps prevent problems faced in the mirroring approach. But, as with protocol extensions, it can cause routing loops and black holes by using out-of-date routing information until the new routes are learned. This interval is much shorter, however, with nonstop routing solutions because the processor does not have to reboot or relearn routes from peers.

The networking industry is split on which of these two approaches to deploy. Avici, Alcatel, Caspian Networks, Chiaro, Charlotte’s Web Networks Ltd., and Hyperchip all use variations of nonstop routing. It’s worth noting, though, that, while these companies recommend that their customers implement nonstop routing, they also plan to support protocol extensions.

There are other software and hardware features that are important for enhancing IP reliability.

Software Modularity

Several companies separate individual software processes on their products so that a failure in one will not affect others on the same platform. With this approach, MPLS, IS-IS, OSPF, and BGP could all run as four separate processes on the control plane. In the event of a failure, a glitch in a software upgrade, or even a table corruption, the faulty process can be taken offline, upgraded, or completely reloaded and restarted without having an impact on the other code modules running on the system.

Typically, older systems such as Cisco’s IOS platform tend to be more monolithic, while newer routing players like Alcatel and Redback claim to have more modular software code.

Alcatel, Allegro Networks Inc., Avici, Caspian, Charlotte’s Network, Force10 Networks Inc., Hyperchip, Laurel Networks Inc., and Redback claim that their routing code runs each routing protocol on separate processes.

Juniper and Chiaro say their software has some modularity, but they do not separate the individual routing protocols into single processes.

”It goes back to having a carrier-class approach,” says Vinay Rathore, Alcatel's director of strategic marketing. “Both software and hardware need to be designed so that things can be upgraded and taken out of service without impacting the network.”

Hitless Software Upgrades

Being able to upgrade a router or switch without losing packets, or even worse, service, is crucial for carrier-class products. Scheduled software upgrades are a leading cause of router outages. A “hitless” software upgrade means that the software can be upgraded without taking the product out of service, and without packet loss.

It’s no coincidence that vendors that support nonstop routing also support this technique – that’s because they load software upgrades onto the backup route processors first. Assuming a successful download, the router then switches over to using the software image running on to the backup processor without missing a beat.

Alcatel, Avici, Caspian, Charlotte’s Web Networks, Chiaro, and Hyperchip claim to support hitless upgrades.

Redundant Hardware

This is basically a no-brainer. All the vendors interviewed for this report offer redundant hardware elements – including power supplies, control plane, and switch fabrics. And all of them claim that these elements can be installed in a hot-swappable configuration, meaning that any element can be pulled out of the box and packets will continue to be forwarded.

Logical Routing

Another feature used by some routing vendors to enhance reliability is logical routing. Allegro, Avici, Charlotte’s Web, and Hyperchip all say they support this feature.

For example, Hyperchip says that it can establish separate logical routers in a single chassis that share a single data plane for forwarding packets, but they maintain their own control planes and routing table information. This feature is useful for carriers that don’t want to abandon their dual-router architectures, because it still allows them to run redundant routers.

”This feature might allow carriers to vertically collapse the number of routers in the network by connecting the multiple route control processors over a backplane,” says Hadriel Kaplan, product line manager for Avici.

But some vendors, including Alcatel, argue that logical routing has little do with reliability and is more useful for scaleability.

Most of this report has focused on features that can be added to routers to improve IP reliability, but there are several standards and IETF drafts that describe ways for improving reliability at the network level, the most important of which is MPLS Fast Reroute.

This technique promises Sonet-like node protection speeds for IP/MPLS-based networks. While IP does eventually reroute around failures, it can take several seconds or even minutes to do this. MPLS Fast Reroute can restore connectivity within 50 to 100 milliseconds.

MPLS Fast Reroute is usually thought of as a core technology, which means typically only vendors with core routing products support it.

Alcatel, Avici, Caspian, Charlotte’s Web, Chiaro, Cisco, Hyperchip, and Juniper all say they support MPLS Fast Reroute.

Further reading on MPLS Fast Reroute:

  • MPLS Demo: Some Answers, Some MIAs

  • MPLS Vendors Demo Fast Reroute

  • MPLS Fast Reroute Gains Momentum

  • MPLS Fast Reroute Gets a Boost

  • MPLS Traffic Engineering

Subscribe and receive the latest news from the industry.
Join 62,000+ members. Yes it's completely free.

You May Also Like