A startup called Gremlin, founded by engineers from Netflix, Google, Amazon and other web-scale companies, is looking to help enterprises improve cloud applications' reliability by using "chaos engineering" to build up the system's defenses.
The system takes out components of an Internet application -- for example, individual servers or connections -- on a controlled basis, to test whether the system recovers gracefully. These planned outages help engineers develop systems resiliency in the face of real, unplanned outages and damage, Kolton Andrus, Gremlin CEO and co-founder, tells Enterprise Cloud News.
Gremlin launched out of stealth and made its service generally available Tuesday, with $8.75 million funding from Amplify Partners and Index Ventures. Customers include Twilio and Expedia, Andrus says.
Netflix Inc. (Nasdaq: NFLX) is generally credited with developing chaos engineering, starting with a tool it called the "chaos monkey." As described on the Netflix Technology Blog in 2011, chaos monkey is "a tool that randomly disables our production instances to make sure we can survive this type of failure without any customer impact." The tool works as if Netflix as "unleashing a wild monkey" in its data center, breaking things. The goal is to test component failures to be sure they don't bring down the entire services.
"By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them," according to the 2011 blog post. "So next time an instance fails at 3 am on a Sunday, we won't even notice."
Netflix developed an entire suite of tools, which it called the "Simian Army," to test failures such as poor latency, as well as finding and shutting down instances that don't conform to best practices, and testing for instance health and security violations.
Andrus says Amazon.com Inc. (Nasdaq: AMZN) was doing the same sort of work at about the same time as Netflix while he was there. Prior to that, "a lot of what we were doing was reactive," Andrus says. "It was whack-a-mole. We were getting paged at night. We wanted to be proactive."
Andrus later joined the Netflix team to continue working on failure testing and chaos engineering.
Now, with startup Gremlin, Andrus and his team of 15 are looking to bring chaos engineering to enterprises and other cloud application developers.
The problem is that cloud applications have made reliability more difficult, Andrus says. In the world of monolithic data center applications, many problems could be solved with redundancy. Now, cloud applications require myriad microservices, relying on third parties for infrastructure.
"It's very difficult for an engineer to be able to hold all that in their head, to be able to understand what might go wrong," Andrus says.
Chaos engineering is like a flu shot or vaccine, Andrus says. "It sounds counter-intuitive, but injecting a little harm helps us understand how the system behaves, and helps us build up our defense against the damage."
The service relies on three key principles: safety, security and simplicity. For safety, every change can be rolled back -- a "built-in undo button," Andrus says. If a change causes a system-wide failure, the experiments can be halted and the system reverted to a steady state. Gremlin also limits the "blast radius" of a change -- the amount of damage it can potentially do.
For security, Gremlin only communicates over SSL, and supports precautions such as permission controls, single sign-on, and role-based access controls.
And for simplicity, Gremlin uses intuitive user interfaces to walk people through running experiments, reporting and controlling tests. The service includes an API to integrate with third-party software, as well as a command line interface for advanced users, Andrus says.
Gremlin tests for a variety of types of failures: CPU failures, disk and memory overconsumption, virtual machine failures, container failures, failures to synchronize clocks, network problems such as failures to resolve DNS, AWS S3 failures, and more.
"It's a bit like a fire drill," Andrus says. "You want to test these things properly, you want to give people an opportunity to practice it, during the day, when their caffeine has kicked in." That way, when the real failure comes in the middle of the night, IT will be ready
- Multicloud Can Lead to Management Nightmare – Survey
- Cisco Intersight Aims to Tame Data Center Management 'Monster'
- Why the Right IoT Management Platform Matters
— Mitch Wagner Editor, Enterprise Cloud News