ANAHEIM, Calif. -- OSA Executive Forum -- It's almost a cliche in the days of software-defined networking (SDN), but
Facebook does indeed say the network is the biggest problem it's facing.
And that's on top of some other big challenges that Najam Ahmad, Facebook's director of network engineers, described during a keynote at the OSA Executive Forum Monday afternoon.
One big problem, for instance are millisecond-long "flash floods" of in-cast traffic -- where many servers communicate with one server -- are a problem. Network-monitoring tools report aggregate activity over one second or more and are too slow to catch these incidents, Ahmad said.
Facebook also has some more commonly heard complaints, such as the fact that every network element needs its own configuration, or the inability to easily change the permutation of servers and storage in a given group of racks.
But the bigger problem is the inability to control how the network works. "One of the things that we want to change is the black box nature of the network," he said.
"The only interaction that the application has with the network is that it puts the packet on the NIC, and it hopes it gets to the other end."
Najam Ahmad wanted to point out one problem in particular.
Putting it this way: Facebook conducts a lot of A/B testing in software, trying out different things on fractions of its population. The company would like to do the same thing with the network.
Ahmad didn't have any immediate suggestions as to how to fix that, but he did present some of Facebook's in-house projects, saying the company would like some help from the optical industry -- probably a nice thing for the vendors in the audience to hear.
The company has tinkered with ways to get more bandwidth out of each cluster of machines in a data center. Immense amounts of traffic come out of these clusters, headed either for other clusters (not necessarily in the same physical building) or for the Internet.
So, it organized its gear in a folded Clos architecture. It got the trick done, but it requires lots of cables to connect everything together. What Facebook wants from optical vendors are alternatives that would simplify that spider's nest of cables.
There are a lot of ways to look at the Clos fabric, but it doesn't get any simpler.
Facebook is also considering replacing servers with racks of components: storage and processing taken out of the server and placed into the racks independently as needed.
"Disaggregation of computing" is how Ahmad put it.
The idea is to stop throwing away server pieces that are still useful. The servers themselves become obsolete, for Facebook's uses, after less than two years. But pieces like the network interface cards and the flash memory would still be useful.
What's needed from the optical networking side is a way to connect everything in that rack. A rackwide fabric, maybe using optical interconnects made of silicon photonics, could be the answer, Ahmad said.
— Craig Matsumoto, Managing Editor, Light Reading