Facebook: Operations Comes First
SANTA CLARA , Calif. -- Open Networking Summit -- Normally, it's telecom network operators who talk about operations challenges, but in his keynote address at the Open Networking Summit, Facebook's Omar Baldonado, director of engineering, said today that focusing on operations over features is his "guiding principle" in running that social media company's giant network.
"We have to prioritize operations over features, and that applies to any technology we build, any technology we buy and any technology we partner with," Baldonado said here today.
Over the last three years, Facebook has been rolling out its Wedge top-of-rack switch and its Facebook open switching system software (FBOSS) -- both of which it has contributed to the Open Compute Project . Early sales of that combo started slow but are now ramping very quickly, according to a graph he showed the ONS crowd -- albeit one without specific numbers.
In reality, there has been little change in the Wedge control plane from the left side of the chart when sales were slow to the right side, Baldonado said. "The amount of control plane differentiation was finished pretty early -- what we have been doing since is making sure all the operations were smooth, making sure we are getting all the things we want in terms of better debugging, monitoring and trouble-shooting tools."
Eventually, Facebook had a platform on which it could innovate, he noted, and having smoother operations was a big part of that and the rapid uptick in Wedge/FBOSS deployments.
Of course, there are still some stark contracts between Facebook and your typical telecom network operator, as Baldonado made clear when he explained how his company has gone from twice-yearly software upgrades to weekly upgrades. In the process, upgrades have gone from being major scheduled events, with all hands on-deck, so to speak, to regular occurrences.
"The reason we upgrade every week is to reduce the amount of changes that are introduced," he said. "If you don't upgrade for a year and you are getting a major new version with tens of thousands of changes, all those 10,000 features interact in funny ways and you don't know which one broke."
Upgrading every week requires a great deal of software and a high degree of automation, but "the benefit is that you have reduced the amount of stuff you have to debug," he said. "I hope this is a practice more and more of the industry does."
And then there's the notion of fast-fail, another aspirational practice for many telecom operators that is part of Facebook's basic operations. "We choose fail fast over fail-proof because making a 100% fail-proof network is a tall order," Baldonado said. "What we have focused on is finding the problems quickly and automatically and correcting them through software."
So instead of trying to find the perfect set of components on which to build a system, Facebook tries to build a system that detects failures and tries to remediate them as soon as possible. Part of that is based on an end-to-end probing system that can isolate a problem, once a pattern emerges that shows one exists, and at minimum, isolate detour traffic away, he noted. Facebook has open-sourced its code around high-speed pinging in such a way that doesn’t create massive traffic overhead.
The Facebook exec also provided an update on OCP, something he says has picked up considerable momentum in the past year, and in particular called out the use of OCP hardware by the CORD project of ONOS , and two use cases from telecom operators -- one from AT&T Inc. (NYSE: T) and another from SK Telecom (Nasdaq: SKM).
— Carol Wilson, Editor-at-Large, Light Reading