This year's Open Compute Project (OCP) Summit produced some really exciting news including Google joining OCP, Microsoft open-sourcing its cloud switch network operating system software SONiC, and the Open Composable Networks vision unveiled by Mellanox. (See Mellanox Offers 'Lego' Approach to Network Components.)
Following on the heels of the formation of the Telco Infrastructure Project (TIP), this OCP Summit had a heavy telco presence: Communications service providers (CSPs) such as AT&T, Deutsche Telekom and SK Telecom want to roll up their sleeves and attract more vendors to build gear for their next-generation networks. (See Facebook TIPs Telcos Towards Open Source Networks.)
When we talk about telecommunications technologies, one of the trends we can't avoid is NFV. NFV is an essential initiative in the CSP community, designed to enable a leap forward to a more agile and efficient infrastructure that contains cost, improves efficiency and boosts service innovations. CSPs want an infrastructure that can deliver services as fast as the over-the-top (OTT) web services and cloud service providers.
But three and a half years after the phrase NFV was coined, it still does not see wide deployment in spite of all the buzz. NFV can learn a few lessons from OCP and the hyperscale players, which are already putting the OCP solution components into large-scale deployments. Here's why:
1. Hardware matters
The hyperscale web and cloud services giants that started the OCP initiative, and which are playing an instrumental role in driving it forward, are the exact same players that motivated Marc Andreessen to argue that Software Is Eating The World back in 2011. As a partner at Andreessen-Horowitz, which has invested in Facebook, Groupon, Skype, Twitter, Zynga and Foursquare, among others, Andreesen believed that we were in the middle of a dramatic and broad technological and economic shift in which software companies were poised to take over large swathes of the economy. In order to survive the digital transformation, modern businesses are becoming software-driven, software-managed and even software-defined. Every firm is a software business and every firm is a socially empowered web-centric business today, or it needs to be if it wants to survive.
Fast forward to 2015, this story further developed into a saga now called, If Software Is Eating The World, Then Open Source Will Chew It Up (And Swallow). Software-driven businesses are increasingly using open source modules as componentized functions to leverage software economy of scale, avoid unnecessary experimentation and bring services to market in much shorter time. At OCP Summit, Microsoft further established its open source practice by announcing that it will open source its Azure Cloud Switch software and call it Software for Open Networking in the Cloud (SONiC). (See Microsoft: Open Source Means Meeting Customers Where They Are.)
As a result, proprietary switch system vendor Arista's stock tumbled.
It is not a coincidence that these web services software giants have collaborated to work on infrastructure hardware. OCP's mission is by and large about re-imagining hardware, making it more efficient, flexible and scalable to achieve greater choice, customization, and cost savings. For the hyperscalers, the notion of a white box extends far beyond cheap hardware, but is a way to disaggregate and flexibly reassemble to get hardware that is optimized for the web services industry as a whole. The telco industry is following suit, and the OCP Telco Infrastructure Project was formally announced during Mobile World Congress as a way to come up with new methods for designing and building telecom network infrastructure.
The NFV community has clearly gone down the software path and has embraced open source through OPNFV and other open source initiatives. What's next? It must be a renewed focus on hardware. What has worked for web services may not be the ideal and optimized choice for network services, with unique requirements on packet performance (especially small packet performance), latency and features such as packet classification, queueing and encryption/decryption. NFV must find ways to encourage hardware innovations, and to think outside of the X86 box to embrace ARM Ltd. and network processors to make the ecosystem flourish.
2. APIs are extremely important
One thing remains clear -- software developers don't want to write code that is hardware specific, and in reality, they should not need to do that. In an ideal world, the most efficient solution is for software to be hardware independent, and for hardware to be plug and play.
In NFV, one thing that is often heard is that, because virtual network functions (VNFs) want to be hardware independent, they often end up doing everything in software, including routine packet forwarding. Actually, let's set the record straight -- software has forever been defining how packets are forwarded, but to steal my good friend Jeff Finkelstein's line, last time he checked, software has not been able to move a photon yet.
So, doing packet forwarding in software means doing packet forwarding in x86 hardware, which, in spite of performance enhancement and optimization through initiatives such as DPDK, still may fall short of network processors or multi-core ARM processors in terms of packet forwarding and handling. Still, how can CSPs enjoy different hardware choices to build infrastructure that is most optimized for communication applications, but somehow avoid the nightmare of managing multiple types of hardware and making their VNFs work on them?
The key is the API. In the OCP networking community, this problem was solved beautifully through Switch Abstraction Interface (SAI), a standardized C language API to program various types of switch ASICs.
As Kamala Subramaniam, principle architect at Azure Networking, pointed out in her blog:
- Before SAI, the underlying complexity of the hardware, with its strict coupling of protocol stack software, denied us the freedom to choose the best combination of hardware and software for our networking needs. SAI allows software to program multiple switch chips without any changes, thus making the base router platform simple, consistent, and stable. A standardized API also allows network hardware vendors to develop innovative hardware architectures to achieve great speeds while keeping the programming interface consistent. Additionally, SAI also enables open and easier software development of features, stacks, and applications.
No ignoring storage
In his keynote at OCP Summit, Jay Parikh, Facebook's vice president of infrastructure engineering, emphasized the need to push the boundaries on storage and discussed a new class of technologies using non-volatile memory (NVM). As we all intuitively know, it does not make sense to have super-fast storage without investing in fast I/O to access your storage media.
Faster storage needs a faster network and my co-worker John Kim, the storage guru, has written many good blogs on this topic including 50 Shades of Flash – Solutions that Won't Tie up Your Storage and 25 Is the New 10, 50 Is the new 40, 100 Is the New Amazing. This is especially true for some of the hottest technology trends in communications: Mobile Edge Computing (MEC) and NFV.
I recently came back from Light Reading’s Big Communications Event (BCE) held in Austin, Texas, where I heard these two things -- MEC and NFV -- mentioned in almost every session.
MEC is exciting for mobile carriers because it can provide them significant differentiation over existing cloud service providers such as Amazon Web Services and Microsoft Azure, which have built massive regional data centers to support their operations and differentiate their cloud offerings. Looking forward, the new generation of applications, such as Internet of Things (IoT), autonomous cars, augmented reality and optimized local content distribution and data caching often need to tap into local content and real-time information about local access network conditions to be most efficient. These have provided a huge opportunity for mobile carriers to leverage the distributed infrastructure they already have, including basestations (while the fixed-line operators have central offices/local exchanges). With this infrastructure already pretty much in place, mobile carriers can achieve the low latency and real-time processing at the mobile network edge without the need to traverse the backbone networks, which adds unnecessary latency.
The other hot topic at this year's BCE was cloud-native NFV. I am thrilled that the industry is going in a direction that was outlined in the Heavy Reading report on Cloud-Native NFV Architecture for Agile Service Creation & Scaling. Some of the key steps to make NFV cloud native are to break down monolithic VNFs into microservices and decouple state from transaction processing (commonly known as stateless applications). Being stateless means that you need to store state in a persistent storage and be able to get to it even when your Virtual Machine (VM) or container instance that is doing transaction processing is no longer running.
This is one of the key properties of cloud-native applications that enables elastic scaling -- you can basically spin up as many VM or container instances as your workload grows and scale out almost infinitely -- and always-on reliability -- if one or more VM or container instances die, other instances can get the state from the persistent storage and pick things right up from there.
To achieve the overall low-latency goal of MEC and cloud-native NFV, it is also critical to look into overall infrastructure design so that the latency to access data and perform data analytics is minimized. As we move from the traditional hard disk drives (HDD) to Solid State Drives (SSD) and now NVM (non-volatile memory), the latency introduced by storage media access has been shortened by several orders of magnitude.
With fast storage, the latency introduced by your physical network and your network protocols becomes more visible and possibly even the bottleneck. For example, NVMe (NVM Express) devices will support up to 2-3 Gbit/s each with latencies lower than 50 us (0.05 milliseconds), compared with 7-10 milliseconds for conventional hard disk drivers. If you are accessing NVM drives across a network fabric, you want to make sure that the fabric is fast enough, otherwise you are wasting money getting expensive storage yet slowing it down by adding significant network and protocol delays. Modern Ethernet supports speeds of 100 Gbit/s per link, with latencies of several microseconds, and combined with the hardware-accelerated iSCSI Extension for RDMA (iSER) and NVMe Over Fabric block protocols, it's perfect for supporting maximum performance on non-volatile memory (NVM), whether today's flash or tomorrow's next-gen solid state storage.
— Chloe Jian Ma, Senior Director, Cloud Market Development, Mellanox Technologies Ltd. (Nasdaq: MLNX)