Virtualized Quantum Packet Core (VQPC), Cisco's virtualized version of the 3G/LTE mobile packet core solution.
Cisco NFVi overview
The NFV Industry Specification Group (ISG) hosted by ETSI (European Telecommunication Standards Institute) has defined an "NFV Infrastructure" (NFVi) block in its reference model. The NFVi includes the hardware resources (in this instance, Cisco UCS servers) and virtualization platform (in this evaluation, OpenStack Juno release, including KVM and the virtual switch).
We evaluated both an Open vSwitch (OVS) implementation using Intel's Data Plane Development Kit (DPDK) and a proprietary Cisco vSwitch implementation.
From a functional perspective, the NFVi provides the technology platform with a common execution environment for NFV use cases, such as business, video, residential, and mobility services. NFVi provides the infrastructure that supports one or more of applications simultaneously and is dynamically reconfigurable between these use cases through the installation of different VNFs.
Figure 1: Cisco's NFV infrastructure (NFVi)
The NFVi is typically bundled with an infrastructure manager, or VIM. In Cisco's NFVi, the VIM functions are provided by Openstack, complemented by proprietary Cisco software for management and orchestration (for details, see Part II of the evaluation report).
The orchestration platform and SDN Controller functions -- Cisco WAN Automation Engine and Cisco Network Services Orchestrator -- were evaluated by EANTC in the previous evaluation, so we did not focus them this time.
Next page: The virtual switch performance test introduction
The virtual switch performance test introduction
For the successful deployment of virtualized network functions, the most important aspect is network connectivity -- to other VNFs, the rest of the data center, services and customers. Often, VNFs are primarily network data filters or relays: A virtual router, a virtual broadband network gateway, a virtual firewall or even a virtual mobile packet gateway -- they are all big packet movers with some add-on functionality.
This is a major difference compared with enterprise cloud data centers. For enterprises, the most important aspect is often the compute and storage function of a virtual machine (VM). Salesforce, Oracle databases, Citrix remote access and others -- they are all big data or compute workhorses that require and produce only a relatively small amount of network data.
What is a virtual switch?
The virtual switch (vSwitch) is the broker of all network communications within a server and to the outside world. It connects physical Ethernet ports to virtual services, multiplexes services onto VLANs, and connects virtual services to other virtual services internally (informally called "service chaining" -- the official ETSI names are more complex and confusing, so we'll address them later).
By nature of the standard x86 environment, the vSwitch is a software component. In its basic variant, the vSwitch moves packets by copying them around, from Ethernet network interface card (NIC) to kernel memory, from kernel memory to user memory and vice versa. In its most basic, vanilla version, this is an awfully slow process: That an industry which has created ultra-low latency hardware-based data center switches with terabits-per-second of throughput needs to go back to software switching is depressing.
Initially, vendors found a very simplistic solution to the problem to get large-scale throughput -- circumvent the vSwitch, allowing the VNF to access the hardware directly. This feature is called "passthrough." It is a great marketing invention, but it does not have a place in an NFV world, as it violates a number of layer abstraction models and, specifically, does not allow service chaining.
vSwitch solution evaluation
Recently, there has been a lot of focus on improving vSwitch performance and validating it. Intel has published the public domain, open source Data Plane Development Kit (DPDK), which provides fast packet processing algorithms. And the Open vSwitch (OvS) project has developed an open source reference implementation for a virtual switch. This implementation supports DPDK as well, improving the performance greatly -- it is called OVS-DPDK.
We tested OVS-DPDK performance on Cisco UCS hardware (for the details, see the test configuration tables on the next page), comparing two Intel CPU generations (Haswell and Sandy Bridge). In addition, we compared OVS performance with Cisco's own virtual switching implementation, called Vector Packet Processing (VPP).
How does innovation work (and pay back the innovators' efforts) in an open source world? At the forefront of development, there are often commercial implementations that are later (and/or with limited functionality) released into the public domain. This is what Cisco has done by developing a proprietary, virtualized forwarder -- VPP. This technology, as Cisco explained, is included in the VMS (Virtual Managed Services) product and in other Cisco products such as the XR9000v virtual router, covered in this test as well. VPP uses proprietary algorithms to further improve packet forwarding for service provider-specific requirements. VPP runs as a Linux user-space process in host, as well as in a guest VM. When running in host mode, it supports drivers to access NICs over PCI; when running in guest mode, it accesses the network interface cards (NICs) over PCI-passthrough. VPP integrates DPDK poll-mode device driver.
VPP and OVS are not exactly comparable regarding feature sets: OVS is much more of a complete, standalone vSwitch implementation than VPP, which is more of an advanced technology building block. The Cisco team explained that VPP is used in Cisco products where its high performance features are required.
Next page: Test goals and configuration
Test goals
We measured the maximum packet forwarding performance of the Open vSwitch and VPP implementations on UCS hardware (both for the Haswell and Sandy Bridge architectures) for the following reasons:
Provide per virtual component and system-side performance benchmarking: While system users do not experience individual component performance, but that of the system as a whole, optimizing the system requires an understanding of the performance of its components. We attempt to provide that understanding.
Provide comparison of bare metal and virtualized systems: A common and important question is, "How much faster or slower will a virtualized system be? What performance tax will I pay for the flexibility of virtualization or what greater speed will I benefit from?" We provide some baseline bare-metal testing to enable this comparison.
Establish NFV benchmark best practices that can be used as a reference: This can be achieved through testing and cooperation with the NFV community, striving to narrow down the number of parameters for further testing. In other words, once we know an option is correct, we do not need to test the incorrect alternatives. And when testing a new combination of elements, we start with a baseline from previous testing. The aim is to automate as much of the testing as possible, for correctness, repeatability and testing efficiency.
Develop a reproducible vSwitch testing process: The NFV community, network designers and engineers, Cisco and the entire industry are all dependent on accurate information on how new NFV-based systems will perform as they scale. This information is currently not available -- we thought it was important to contribute to this much needed knowledge base.
Test configuration
Cisco provided a couple of UCS C240 M4SX and UCS C240 M3S blade servers for the testing processes. On the physical hardware Cisco installed Ubuntu 14.04.3 LTS (VMM -- virtual machine manager) with KVM for the virtual machine environment, DPDK 2.0 and OpenVswitch 2.4.0.
Here are the full details of the hardware and software components:
Figure 2: The Evaluation Hardware
Figure 3: The Evaluation Software
Next page: Test methodology and setup
Test methodology and setup
We used Intel's Ethernet frame forwarder tool (l2fwd) as a reference VNF to verify the forwarding performance of the virtual switches. It was installed in the guest operating system (OS) with enabled "hugepages" option, improving the efficiency of the memory management.
Cisco decided to "pin" three CPU cores for the guest OS. Pinning is a technique typically used to statically assign CPU cores to threads that have near real-time requirements; without pinning, the time-sharing of CPU cores between threads would create delay variation in switching and reduce the efficiency. DPDK vhost-user ports (user space socket servers) were configured for the data-plane connectivity between the virtual switch and the guest OS.
We exercised multiple scenarios, as shown in the diagram below. In all cases, traffic was generated by an Ixia load generator with two 10GigE interfaces connected to the UCS system.
Topology 1: VNF-to-VNF, measuring the performance of pure virtual switching -- important for service chaining, for example. Traffic was sent by passthrough (bypassing the vSwitch) to l2fwd instances, which interfaced with the vSwitch.
Topology 2: Virtual path performance across NIC, vSwitch and VNF -- suitable for regular, single-VNF full-path network scenarios.
Topology 3: Standalone vSwitch forwarding between physical NIC ports, a reference test for vSwitch performance (in this instance, used for lookup tests only).
Topology 4: Standalone VNF forwarding between physical NIC ports without a vSwitch, to get a baseline figure of physical passthrough performance without vSwitch.
Figure 4: vSwitch Test Topologies (from left to right): VM-to-VM; full virtual path; vSwitch only; passthrough evaluation of Cisco's IOS XRv 9000 virtual router.
All tests were carried out in line with the methodology specified by the OPNFV VSPERF initiative for vSwitch performance testing. These methods are mostly based on one of the foundations of IP benchmarking standards, IETF RFC 2544 (Benchmarking Methodology for Network Interconnect Devices).
In the course of the joint pre-staging, we discovered that virtual switches behave very differently from hardware-based switches, as they are based on non-real-time environments. As a result, we adapted the methodology, including: a) multiple runs to confirm results statistically; b) accepting a minimal loss threshold (0.001 %) for some test runs; c) running linear sweep tests. For more details, see the final page of this report, "Cisco and EANTC evolved vSwitch test methodology."
Next page: VM-to-VM performance
VM-to-VM performance
Evaluation overview: Cisco's VPP reached up to 10 Gbit/s, 1.6 Million frames/second throughput, OpenvSwitch up to 7 Gbit/s, 1.09 Million frames/second -- each with a single core. VPP showed more predictable behavior than OVS when both were brought to their limits. Performance in SandyBridge and Haswell architectures differed only slightly.
Test topology 1 verified switching performance between virtual network functions via the virtual switch and their vhost-user virtual links. This will be one of the most important use cases in the future, when multiple VNFs will share host resources.
It is of course crucial that the vSwitch does not take all of the host's compute power away from the VNFs. We discussed with Cisco how many cores should be allocated to a vSwitch and agreed to use just a single core for the vSwitch.
This was a good decision for two reasons. First, this is a good baseline that can be scaled to multiple cores later. Second, OVS 2.4.0 interacting with DPDK 2.0 and QEMU (the virtual queue library) was lacking a specific multi-queue ability required for multi-core switching support. We will report this issue back to the relevant projects.
The throughput performance results were very interesting, so let's discuss the following graphs:
Figure 5: VM-to-VM throughput performance on Haswell architecture.
Figure 6: VM-to-VM throughput performance on Sandy Bridge architecture.
With our standard setting accepting 4 parts per million (4 ppm, or 0.0004%) packet loss, Open vSwitch reached up to 1.1 million Ethernet frames per second (Mfps) with small frames, or 7 Gbit/s throughput with large frames. Cisco's commercial platform, VPP, reached up to 1.2 Mfps with small frames, or 10 Gbit/s with large frames. Naturally, performance depends primarily on numbers of frames, so the throughput was much lower with smaller frames of 64 and 256 byte sizes. We also used standard Internet Mix (IMIX) mixed frame sizes which yielded 2 Gbit/s for Open vSwitch and 3 Gbit/s for VPP. These numbers would likely scale with a higher number of cores.
It was a big surprise that the throughput substantially increased when we accepted slightly higher frame loss. We noticed that a non-zero, but small, loss ratio persisted over a broad range of throughput: for example, accepting a higher loss of 0.01% or 100 ppm, Cisco's VPP yielded 14 Gbit/s throughput instead of 10 Gbit/s with 1518 byte sized frames. The reason is that a non-real time system may lose a small number of packets by buffer overflows even if it is not fully loaded. In contrast, hardware switches typically have a real-time approach and a throughput limit beyond which the packet loss rate increases quickly and linearly.
How much packet loss is acceptable depends on the application scenario and it should be noted that the vSwitch is only one component contributing to the end-to-end solution. For example, video-over-IP can typically accept 0.1% or 1000ppm loss (depending on the codec and transport stream settings); a vSwitch contributing 10% of that limit (0.01% loss or 100ppm) will likely be contributing too much loss for that application. (Note, though, that the industry has not converged on specific acceptable loss values yet.)
Next, we measured the forwarding latency as an indication of how long the frames are stuck in any buffers in the system. Naturally, latency was higher for larger packets due to the serialization delays (the time it takes to get a packet off and back on the wire) and the amount of time required to read and copy packets in memory. For small 64-byte sized frames, both Open vSwitch and VPP showed latency of 20 microseconds, which compares with hardware switches. 1518-byte sized frames were handled with an average latency of 100 microseconds. The maximum latency varied quite a bit, though: While VPP serviced all packets within at most 200 microseconds, OVS sometimes took more than 400 microseconds for the same task.
Both the sensitivity to loss and the maximum latency of Open vSwitch showed that the open-source implementation is less predictable at this point. Cisco's VPP technology excelled, behaving in a more controlled way (see graph below).
Figure 7: VM-to-VM latency performance comparison at maximum throughput.
Next page: Full virtual path performance
Full virtual path performance
Evaluation overview: In the full virtual path scenario, Cisco's VPP reached up to 20 Gbit/s, 2.5 million frames/second throughput with a single core. Open vSwitch provided between 20-40 Gbit/s, 4-6 million frames/second throughput, varying greatly across measurements.
While the previous test scenario focused on vSwitch throughput only, we aimed to cover the full virtual path performance in the second scenario. The full virtual path performance includes the hardware (Network Interface Card), the virtual switch and a reference VNF. This scenario is important for single virtual network functions aiming to use the vSwitch, as all VNFs should eventually do (instead of bypassing the virtualization infrastructure). In this scenario, each frame passes the vSwitch twice -- in the vSwitch throughput results, we multiply the load generator's frame numbers with two accordingly.
The vSwitch environment was identically configured as before.
We started with linear loss rate and standard RFC2544 lossless single run tests for both VPP and OVS-DPDK solutions.
The VPP implementation yielded stable results up to 20 Gbit/s throughput with large packets and 2.5 Mfps with 64-byte packets.
However, we quickly noticed inconsistencies in throughput performance for OVS-DPDK virtual switch testing. OVS-DPDK showed only 8 Gbit/s throughput with large packets, which seemed unreasonably low. When we reran the test, the results were different. Due to spurious and transient very small packet losses with the OVS-DPDK implementation, the standard RFC2544 results were simply not reproducible and often unreasonably bad. To adapt to the non-real time software environment, we subsequently changed the test methodology to "BestN/WorstN" to yield statistically significant results. We applied five test runs (N=5) for each test scenario. (For more details, see the final page of this report, "Cisco and EANTC evolved vSwitch test methodology.")
The following graphs show the results for BestN/WorstN and for the single run test. As expected, the measured throughput value for single run test was in between the BestN/WorstN throughput range for the VPP performance test. In contrast, OVS-DPDK shows the out-of-range values for the single run test and a large variation of results for lossless throughput. These results show that the OVS-DPDK implementation is currently not optimized for stable lossless Ethernet frame forwarding. Or, viewing things from the other side, one could conclude that RFC2544 old-school testing is not resulting in statistically significant values with virtual switches, such as OVS-DPDK. The results vary wildly, with IMIX between 3-14 Gbit/s in our five test runs.
We suggest using multiple test runs 'N' to determine accurate worst or best performance when OVS-DPDK is used. The exact number of 'N' could subsequently be derived mathematically based on the volatility of results, calculating a confidence interval. (We will save the reader from detailed mathematics here and will follow up with standards bodies.)
BestN/WorstN test results for full path of forwarding performance
Figure 8: Full virtual path Cisco VPP lossless throughput performance on Sandy Bridge
Figure 9: Full virtual path OVS-DPDK lossless throughput performance on Sandy Bridge.
OVS-DPDK showed small minimum and average latency results across all packet sizes, however the maximum latency for small packets was very high, with 1,400 microseconds (as the graph below shows).
VPP maximum latency was lower, providing more consistent treatment of packets, specifically for small packets and the IMIX packet size mix.
Figure 10: Full virtual path latency performance.
Next page: Virtual switch FIB scalability
Virtual switch FIB scalability
Evaluation overview: In a pure virtual switching scenario, VPP showed no throughput degradation when forwarding to 2,000 IPv4 or Ethernet MAC addresses at 20 Gbit/s, less than 1% reduced throughput towards 20,000 MAC addresses and 23% reduced throughput when forwarding to 20,000 IP addresses. OVS IPv4 and Ethernet forwarding was reduced by 81% when forwarding to 2,000 IPv4 addresses.
Both previous scenarios focused exclusively on Ethernet layer throughput with a small number of flows because they both involved virtual machines. In the third scenario, we evaluated pure virtual switching without any actual application. This is, of course not a realistic application setup: instead it is a reference test of vSwitch properties that need to be determined independently of VNF performance.
One of the most basic and important scalability figures of an Ethernet switch is its ability to handle many Ethernet flows between different endpoints (associated with MAC addresses) in parallel. In a data center, there is usually much more East-West traffic between servers and services directly connected on the Ethernet segment than there is North-South traffic. vSwitches need to enable virtual services to participate in data center communication and need to be able to connect to many Ethernet destinations in parallel.
Separately, vSwitches obviously need to support many IP addresses in their forwarding information base table (FIB) simultaneously, when configured for IP forwarding. If traffic is routed towards a virtual firewall or a virtualized packet filter, there are usually tens of thousands of flows involved from thousands of IP addresses.
We verified forwarding performance of the standalone virtual switch with multiple layer 2/layer 3 FIB table sizes.
VPP showed its strengths based on the optimized, vector-based handling of large tables. It achieved very consistent IPv4 forwarding: The throughput was not dropped at all when forwarding to 2,000 IPv4 addresses compared to a single address scenario; throughput dropped only by 23 % when forwarding to 20,000 IPv4 addresses. The average forwarding latency was largely unaffected by the larger tables, and the maximum IPv4 forwarding latency was still bearable at 400 microseconds for 2,000 address entries and 1,200 microseconds for 20,000 entries.
In contrast, Open vSwitch seems to use less optimized FIB lookups. Throughput dropped from 20 Gbit/s (IPv4) and 8.8 Gbit/s (Ethernet) for the single-entry case down to around 4 Gbit/s in both cases for 2,000 IPv4 and MAC addresses. For 20,000 FIB entries the OVS-DPDK implementation was not usable in its current version as the throughput dropped to almost zero and maximum latency skyrocketed to 37 milliseconds (not microseconds!). We are not complaining -- after all, it's free software -- and in fact we hope that Cisco will contribute its VPP improvements back to OVS to improve the open source software.
These results indicate much better VPP performance in higher scale network deployments.
Figure 11: vSwitch-only IP forwarding performance.
Figure 12: vSwitch-only Ethernet forwarding performance.
Figure 13: vSwitch-only latency performance.
Next page: Evaluating the Cisco IOS XRv 9000
Evaluating the Cisco IOS XRv 9000
Evaluation overview: Cisco's IOS XRv 9000 virtual router excelled with up to 35 Gbit/s and up to 8.5 Mpps throughput in a feature-rich configuration, using 14 Haswell cores and PCI passthrough configuration. Forwarding latency was in line with expectations in most cases.
As the last of four testing scenarios, we evaluated the performance of a commercial virtual router implementation provided by Cisco. As the product name "Cisco IOS XRv 9000" indicates, the virtual router that Cisco supplied to our test is based on IOS XR, the routing system Cisco developed first for the CRS-1 and has used for core and aggregation routers ever since.
For this performance test, Cisco used the UCS C240 M4SX (Haswell) hardware platform. 14 CPU cores (of the total 16 cores) were pinned for the virtual router. This makes sense, as the virtual router is the main virtual service in this case. The test utilized four 10GE interfaces connecting directly with the VM by PCI passthrough technology. In general, EANTC is much in favor of using vSwitches for all applications, but here the purpose of the test was to achieve a baseline performance evaluation of the virtual router without being limited by a vSwitch.
The Cisco team told us they chose the XRv 9000 virtual router from the company's portfolio of virtual routers since it uses VPP technology as well and is a full-featured router implementation for service provider environments.
The XRv 9000 was configured with rich features -- ingress ACLs, ingress color-aware hierarchical policing, egress hierarchical QoS with parent-shaping, child-queuing, packet remarking and Reverse Path Forwarding (ipv4 verify unicast source reachable-via any) and a mix of IPv4 and IPv6 traffic. EANTC validated the 174-line IOS XR configuration in detail.
The XRv 9000 achieved lossless throughput of more than 35 Gbit/s for packets sizes of 512 bytes or more and up to 8.5 million packets per second (Mpps) with small packets.
This is a very impressive result that is, of course, influenced by the 14 cores being the workhorses for the virtual router, whereas all previous scenarios had been tested with just a single core. Nevertheless it is reassuring to see that the virtual router can handle 62% of line rate with a realistic mix of packet sizes, and 90% of line rate with large packets of 512 bytes or more. There have been higher throughput values touted before, but let's not forget that the XRv 9000 is a full-featured virtual router and we actually used quite a few of those features in our test. Cisco certainly did not go for the low-hanging fruit in this test.
Figure 14: Cisco IOS XRv 9000 IPv4/IPv6 throughput performance.
Figure 15: Cisco IOS XRv 9000 minimum, average and maximum latency performance.
With around 100-1,000 microseconds, the minimum and average latencies were well line with our expectations for all packet sizes, except 1518 bytes. The 1518 bytes results, coupled with the fact that the maximum latencies rose to 25-100 milliseconds, indicates an implementation or configuration issue.
Next page: Conclusion of vSwitch test findings
Conclusion of vSwitch test findings
At the end of a full week of in-depth vSwitch testing, including overnight and weekend automated tests (those RFC2544 runs consume a lot of time!), we gathered a tremendous amount of data about Cisco's VPP (both as a pure vSwitch and serving in the XRv 9000 virtual router) and the DPDK-enabled version of Open vSwitch.
vSwitch technology and Cisco's virtual router implementation are definitely getting there. We witnessed a single Haswell or Sandy Bridge core achieve up to 20 Gbit/s Ethernet switching throughput and 2.5 Gbit/s virtual routing throughput. This is really a great step for an industry which, let's not forget, is still in the early stages of a new technology development cycle. It underlines the power of open source development, where many vendors cooperate to progress quickly.
At the same time, the VPP performance results were much more consistent and reliable than those of Open vSwitch. Obviously, the traditional vendor model still has advantages when it comes to quality assurance and when reliable software needs to be bundled for use in mission-critical service provider applications.
The EANTC test is one of the first comprehensive, independent and public vSwitch performance evaluations. The results confirm that the concept of virtual switching and routing in the context of the ETSI NFV virtualization model is feasible.
Cisco's commercial VPP implementation used amazing techniques to get more consistent performance out of the system, while it's clear that the open source solution will soon become usable for large-scale deployments once a few more glitches will have been eliminated.
By the beginning of 2016, there should be no more good reason for using the "passthrough" direct network hardware access method that breaks virtualization concepts.
With the vSwitch test done, EANTC can confirm that the machine room performs as needed. In Part 2 of this NFVi evaluation report, we will take the next steps and look at the other main service provider pain points -- the manageability and reliability of a virtualized infrastructure.
Next page: Cisco and EANTC evolved vSwitch test methodology
Cisco and EANTC evolved vSwitch test methodology
All throughput measurements based on RFC 2544 use a binary search algorithm. A binary search allows test teams to find the throughput with a specified resolution in a minimum number of traffic runs. Basically, when searching between 0 and 100%, one starts with 100% throughput; if that run yields zero loss, the test is done, otherwise the next run is at 50%. Subsequently, either 25% or 75% are tested depending on the result of the previous measurement. The next steps continue at half of the previous interval steps until a specified precision has been reached.
We noticed, however, that the Open vSwitch did not always show reproducible or linear behavior. Some measurements at a certain load yielded zero packet loss. Others with the exact same rate resulted in a very small loss (for example, 10 packets out of a 1 million), seemingly an effect of buffer or interrupt management. In other scenarios, the Open vSwitch showed small loss, for example at 12% throughput, but continued to function without loss at 13%, 14% and 15% throughput -- probably due to the same non-deterministic small loss behavior.
We extended the test methodology, including the standard RFC 2544, as follows:
Single run test
Single Run (SR) tests execute a single run of RFC2544 binary search algorithm to measure the throughput for defined frame sizes. We used 64, 256, 512 and 1518 bytes frame and as well as IMIX frame with distribution 64 bytes:7, 570 bytes:4, 1518 bytes:1.
Linear loss rate test
This test measured packet loss ratio resulting from linear increase of offered load from 1% of line rate to 100% of line rate, with step of 1% of line rate. IMIX frames were used for the measurement.
BestN/WorstN test
This test uses more samples to drive the binary search and yield statistically more accurate results. This keeps the heart of the RFC2544 methodology, still relying on the binary search of throughput at specified loss tolerance, while providing more useful information about the range of results seen in testing. Instead of using a single traffic run per iteration step, each traffic run is repeated N times and the success/failure of the iteration step is based on these N traffic runs. We defined two types of revised tests -- Best-of-N and Worst-of-N.
Best-of-N: Best-of-N indicates the highest expected maximum throughput for (packet size, loss tolerance) when repeating the test.
Worst-of-N: Worst-of-N indicates the lowest expected maximum throughput for (packet size, loss tolerance) when repeating the test.