Whenever a serious computer or network security issue becomes public, one of the first questions IT professionals ask is, "What lessons can we learn?" It's a polite way of phrasing the real question: "How do I keep my company out of the news for something like this?" But what's a professional to do when the best answer to the question might well be that absolutely nothing any reasonable company might have done would stop the problem? That's a very different question.
2017 has seen a spate of news generated by errant keystrokes, from the "Cloudbleed" vulnerability that exposed millions of pieces of personally identifiable information to the AWS outage that brought large portions of the Internet to its knees. Finding a single keystroke going awry makes the classic "needle in a haystack" analogy insufficient. Finding a needle in a thousand-acre field of haystacks might be more like it -- and that's something that may simply go beyond reasonable.
Bill Curtis is someone who seems well suited to answer questions involving command and software quality -- especially software quality. A founder of the Consortium for IT Software Quality (CISQ), Curtis was the leader of the project that created the Capability Maturity Model (CMM) for both software and people. A long-time university professor, Curtis is now senior vice president and chief scientist at CAST and remains a member of the CISQ board of directors.
In a telephone interview with Light Reading, Curtis was reluctant to criticize the software developers at Cloudflare for the incident that became known as Cloudbleed. "There are things that are humanly possible in terms of testing and detection, and then there are things that are just so far out there, they can happen and it's a tragedy when they do, but it's hard to say that they were negligent in their work because it really would have taken some bizarre thinking into the conditions that could occur," he said.
Curtis said that part of the problem of finding the vulnerability is that it did not, in all likelihood, involve a programming mistake. Instead, it was the result of using a parser built on Ragel (not developed in-house by Cloudflare Inc. ) in a very particular, very specific set of circumstances. Within those circumstances, a buffer overflow could occur, and personal information could be released.
The buffer overflow was, according to Curtis, part of what made early detection of a problem so difficult. "Here's the thing about buffer overflows; we don't really do a lot of analysis on buffer overflows and the reason is that there's a zillion false positives -- it just creates havoc," Curtis said. "Some of our competitors go after buffer overflows and they get flooded with false positives."
"For most of these buffer overflows it's really the context that makes that code cause an overflow. And you've got to understand the context, which is not easy. That's a whole 'nother level of analysis and if you read the piece that Cloudflare wrote they listed all the conditions that had to occur," Curtis explained.
"That's a nightmare to go find through static analysis, or even if you're a smart guy," Curtis said, pointing out that there is no reasonable testing regimen that can be expected to find all the issues in complex, modern software systems. "That's the problem we have in software; the incredible complexity that we've gotten into now and the difficulty of detecting these [issues]," Curtis said.
He pointed to a software quality regimen that found an extraordinary number of issues, but went beyond the effort that most organizations can afford -- the detection and testing regiment for the avionics systems on the Space Shuttle. "These guys were at a point where the defects they were detecting were all over ten years old in the code. They weren't generating new defects," Curtis said. "And their analysis, detection, and testing were so thorough, in fact two-thirds of all their effort was in testing."
The professionals in the software development group on Space Shuttle avionics spent much of their time coming up with bizarre scenarios involving anomalies that no one had ever seen, but that were not impossible according to the laws of physics. Commercial developers would have to go into the same sort of imagination exercise to find interactions like the one that led to Cloudbleed. "You'd really have to be thinking, 'What really isn't probably going to happen but possibly could?' If all these different conditions occurred, you'd say that there were all these bizarre little things that had to happen in order for buffer overflow to occur," Curtis explained.
Curtis thinks that the best prospect for avoiding Cloudbleed-like future problems may lie with the computers themselves. "For these things that are context dependent and very tricky, I'm hoping that we can apply machine learning techniques, that maybe the machine learning can go out and begin to understand some of these bizarre contexts and find some of the things that might have been innocent but go on to create some serious problems," he said.
Until machine learning becomes the norm, Curtis believes that rapid response to revealed issues is the practical model for the future, especially since there's no blame to be placed on the development program at Cloudflare. "If this could have occurred frequently then, yeah, they screwed up. But if they couldn't have anticipated the complex set of circumstances required for it to occur then they weren't negligent," he said. "You know, we get a lot of this in complex systems, where people just couldn't have imagined all the interactions that led to the problem. And that's something we're going to live with more and more as these systems get more complex and we have different pieces coming from different vendors."
— Curtis Franklin, Security Editor, Light Reading