NFV Specs/Open Source

Facebook: Operations Comes First

SANTA CLARA , Calif. -- Open Networking Summit -- Normally, it's telecom network operators who talk about operations challenges, but in his keynote address at the Open Networking Summit, Facebook's Omar Baldonado, director of engineering, said today that focusing on operations over features is his "guiding principle" in running that social media company's giant network.

"We have to prioritize operations over features, and that applies to any technology we build, any technology we buy and any technology we partner with," Baldonado said here today.

Over the last three years, Facebook has been rolling out its Wedge top-of-rack switch and its Facebook open switching system software (FBOSS) -- both of which it has contributed to the Open Compute Project . Early sales of that combo started slow but are now ramping very quickly, according to a graph he showed the ONS crowd -- albeit one without specific numbers.

In reality, there has been little change in the Wedge control plane from the left side of the chart when sales were slow to the right side, Baldonado said. "The amount of control plane differentiation was finished pretty early -- what we have been doing since is making sure all the operations were smooth, making sure we are getting all the things we want in terms of better debugging, monitoring and trouble-shooting tools."

Eventually, Facebook had a platform on which it could innovate, he noted, and having smoother operations was a big part of that and the rapid uptick in Wedge/FBOSS deployments.

Of course, there are still some stark contracts between Facebook and your typical telecom network operator, as Baldonado made clear when he explained how his company has gone from twice-yearly software upgrades to weekly upgrades. In the process, upgrades have gone from being major scheduled events, with all hands on-deck, so to speak, to regular occurrences.

"The reason we upgrade every week is to reduce the amount of changes that are introduced," he said. "If you don't upgrade for a year and you are getting a major new version with tens of thousands of changes, all those 10,000 features interact in funny ways and you don't know which one broke."

Upgrading every week requires a great deal of software and a high degree of automation, but "the benefit is that you have reduced the amount of stuff you have to debug," he said. "I hope this is a practice more and more of the industry does."

Omar Baldonado, Facebook's director of engineering
Omar Baldonado, Facebook's director of engineering

And then there's the notion of fast-fail, another aspirational practice for many telecom operators that is part of Facebook's basic operations. "We choose fail fast over fail-proof because making a 100% fail-proof network is a tall order," Baldonado said. "What we have focused on is finding the problems quickly and automatically and correcting them through software."

So instead of trying to find the perfect set of components on which to build a system, Facebook tries to build a system that detects failures and tries to remediate them as soon as possible. Part of that is based on an end-to-end probing system that can isolate a problem, once a pattern emerges that shows one exists, and at minimum, isolate detour traffic away, he noted. Facebook has open-sourced its code around high-speed pinging in such a way that doesn’t create massive traffic overhead.

The Facebook exec also provided an update on OCP, something he says has picked up considerable momentum in the past year, and in particular called out the use of OCP hardware by the CORD project of ONOS , and two use cases from telecom operators -- one from AT&T Inc. (NYSE: T) and another from SK Telecom (Nasdaq: SKM).

— Carol Wilson, Editor-at-Large, Light Reading

kq4ym 3/27/2016 | 4:10:22 PM
Re: Processes and people I wonder if the onece a week procedures might also have to do with being a company strongly managed on top by one guy who happens to have a tech background to begin with. A different management style of other companies might not be too comfortable with this style. But in the end operations should come first, without that the features aren't going to work out well in a fast changing and growing environment.
danielcawrey 3/20/2016 | 4:15:24 PM
Re: Processes and people One of the benefits Facebook has had is that it was able to set organizational philosophy. For larger companies, this isn't as easy to do. It becomes more of a change management issue, whereas with Facebook it is simply standard operating procedure. Can it be done? Yes, but it takes operational execution, that's for sure. 
Joe Stanganelli 3/20/2016 | 3:22:19 PM
NFLX Netflix, too, adopts the "fail fast" philosophy/architecture for much the same reason -- and the company's engineers credit "fail fast" for Netflix being largely unimpacted by the Great AWS Crash a few years ago.
Gabriel Brown 3/17/2016 | 5:58:40 AM
Re: Processes and people They learnt a few tricks from the telcos -- leave the numbers off the chart. Telco-y to the core.
[email protected] 3/17/2016 | 5:36:10 AM
Processes and people The models are there and the evidence that this sort of process management works.... how can it get assimilated into divisions, teams and individuals in organizations that have never worked like this? And who is going to do the training?
Sign In