Testing Cisco's Media-Centric Data Center
Now that we had tested different components and features of the data center, it was time to test Cisco’s data center in the context of a full IP video "medianet." This meant using the traffic profile of both business and residential applications to test the IP Video Service Delivery Network. We sent emulated Digital Signage, TelePresence, and Video Surveillance traffic into the network through the Nexus 5000. IP video, Internet, and VoIP traffic were attached to the network via the Cisco Nexus 7000 as depicted in the image below. To round out the service offering, VoD and pay-per-view traffic was also transported in the network, but these services entered the network via the ASR 9010 and did not traverse the data center.
Cisco designed its IP video-centric data center to address the key concern service providers have – avoiding a single point of failure. In the Service Delivery Network topology, there are customers from a wide range of distant locations accessing a single data center. It is very important to install and configure the proper redundancy mechanisms in the data center in order to avoid upsetting, not merely one of your customers, but all of them.
One such redundancy mechanism often seen in data centers is the IEEE’s Link Aggregation Group (LAG, defined in IEEE 802.3ad). This mechanism allows the network administrator to bind several physical links together in a group. When links within the group fail, the other, still active links, will then carry the traffic, as long as there is enough capacity in the group. This solves the issue of potential links between two switches failing – however, what happens when a complete switch fails?
Using the “traditional” LAG mechanism, we would experience complete loss of services. To solve this issue and therefore increase the level of resiliency in the data center, Cisco used a new feature available in the latest Nexus switches code release NX-OS 4.1(5) that is similar to LAG, but is configured among three devices.
Cisco calls this feature Virtual Port Channel (VPC). VPC is a virtual port group, which can be distributed across multiple devices, allowing the full bandwidth capacity of multiple links to be used. In addition, the physical (hardware) and logical (OS) resources on the Nexus 7000 switches can be virtualized where any set of ports can become members of a Virtual Device Context (VDC, a virtual switch). These two complementary configurations were used to virtualize all business and residential traffic in the test. The figure below depicts the virtual port channels configured in the test:
We tested that the system could effectively provide resilient connectivity to the data center by sending the same traffic we used in our service delivery network tests. Our goal was to verify that if a link between two Nexus devices failed, the traffic being carried over the failed link would still flow using a different path.
Three emulated business applications were attached to the network on the Nexus 5000 switch using four 10-Gigabit Ethernet interfaces. Cisco configured each incoming 10-Gigabit Ethernet port on the Nexus 5000 to accept traffic for a single business service. The Nexus 5000 was then configured to split all traffic for each service evenly to each downstream Nexus 7000.
In essence, a VPC with a link to each Nexus 7000 was configured for each data center business service, with the exception of telepresence, which used two VPCs for this test. This configuration would be reasonable for an operator that knows its data center traffic utilization very well. Since trending and capacity planning are best-practice in service provider networks and data centers, we accepted the configuration. Cisco explained that if a single link in the VPC failed, the other link would then transport the full load for that service, rerouting to the proper Nexus 7000 device via the links between the two Nexus 7000 devices. The diagram above displays this forwarding behavior.
Our initial plan was to fail all links on one side of the data center and verify that the other side could maintain the service and perform the switchover. We were informed by the Cisco engineering team that recovery from such multiple failure was not possible, and so we were left with the usual failure scenario – a single link.
We tested the virtual port channel’s ability to recover from a group member's single-link failure by using the full service delivery network traffic profile and disconnecting a single downstream link from the Nexus 5000. After understanding the way Cisco configured the data center, we expected only one service to be affected. Cisco’s claim was that the service would recover from the broken link in less than one second. The figure below shows the link that was disconnected for the test simulating the link failover in the data center.
EANTC’s standard failover test procedure calls for repetition of the tests three times in order to collect a minimum statistical significance in the results. In this particular case, due to inconsistency of the results in one test run, we decided to perform an additional two test runs, bringing us to five test runs in total (or actually five failover test runs and five recovery test runs).
The link we abused was serving exactly four business ports attached to the uPE1. We expected that when we failed the link the business ports attached to the second uPE would not show any negative effect, and indeed the results showed almost that. A few frames were still lost on ports that we did not expect, which Cisco explained was an effect of the hashing process used. The graph below shows the highest out-of-service time recorded on a single port amongst the four ports where we expected to lose traffic. In fact, the loss observed across these four ports was consistent as to be expected, never differing by more than 20 lost frames. The news was positive regardless. All test runs on all ports showed that the recovery times never exceeded the one second claimed by Cisco.
In the world of backbone routing, the results we show above are perhaps disheartening. As an update to the available data center resiliency mechanisms, such as various Spanning Tree Protocols and Link Aggregation Groups, Cisco claims that the results shown here are an improvement. From our testing experience, Spanning Tree Protocols could indeed require seconds to converge, which acknowledges Cisco claims and is clearly another valid and unique data center resiliency mechanism for service providers.
Next Page: Results: In-Service Software Upgrade (ISSU)