& cplSiteName &

Gremlin Looks to Bring 'Chaos Engineering' to the Masses

Mitch Wagner

A startup called Gremlin, founded by engineers from Netflix, Google, Amazon and other web-scale companies, is looking to help enterprises improve cloud applications' reliability by using "chaos engineering" to build up the system's defenses.

The system takes out components of an Internet application -- for example, individual servers or connections -- on a controlled basis, to test whether the system recovers gracefully. These planned outages help engineers develop systems resiliency in the face of real, unplanned outages and damage, Kolton Andrus, Gremlin CEO and co-founder, tells Enterprise Cloud News.

Gremlin launched out of stealth and made its service generally available Tuesday, with $8.75 million funding from Amplify Partners and Index Ventures. Customers include Twilio and Expedia, Andrus says.

Netflix Inc. (Nasdaq: NFLX) is generally credited with developing chaos engineering, starting with a tool it called the "chaos monkey." As described on the Netflix Technology Blog in 2011, chaos monkey is "a tool that randomly disables our production instances to make sure we can survive this type of failure without any customer impact." The tool works as if Netflix as "unleashing a wild monkey" in its data center, breaking things. The goal is to test component failures to be sure they don't bring down the entire services.

"By running Chaos Monkey in the middle of a business day, in a carefully monitored environment with engineers standing by to address any problems, we can still learn the lessons about the weaknesses of our system, and build automatic recovery mechanisms to deal with them," according to the 2011 blog post. "So next time an instance fails at 3 am on a Sunday, we won't even notice."

Netflix developed an entire suite of tools, which it called the "Simian Army," to test failures such as poor latency, as well as finding and shutting down instances that don't conform to best practices, and testing for instance health and security violations.

Early aviators blamed accidents on mischievous sprites they called 'gremlins.' The stories gained popularity during World War II. Kolton Andrus, CEO and co-founder of the startup named for the creatures, says the stories and artwork are popular around his company.
Early aviators blamed accidents on mischievous sprites they called "gremlins." The stories gained popularity during World War II. Kolton Andrus, CEO and co-founder of the startup named for the creatures, says the stories and artwork are popular around his company.

Andrus says Amazon.com Inc. (Nasdaq: AMZN) was doing the same sort of work at about the same time as Netflix while he was there. Prior to that, "a lot of what we were doing was reactive," Andrus says. "It was whack-a-mole. We were getting paged at night. We wanted to be proactive."

Andrus later joined the Netflix team to continue working on failure testing and chaos engineering.

Now, with startup Gremlin, Andrus and his team of 15 are looking to bring chaos engineering to enterprises and other cloud application developers.

The problem is that cloud applications have made reliability more difficult, Andrus says. In the world of monolithic data center applications, many problems could be solved with redundancy. Now, cloud applications require myriad microservices, relying on third parties for infrastructure.

"It's very difficult for an engineer to be able to hold all that in their head, to be able to understand what might go wrong," Andrus says.

Chaos engineering is like a flu shot or vaccine, Andrus says. "It sounds counter-intuitive, but injecting a little harm helps us understand how the system behaves, and helps us build up our defense against the damage."

Gremlin supports containers, and is cloud-agnostic, working with Amazon Web Services Inc. , Microsoft Azure , Google Cloud Platform and bare metal servers in the data center.

The service relies on three key principles: safety, security and simplicity. For safety, every change can be rolled back -- a "built-in undo button," Andrus says. If a change causes a system-wide failure, the experiments can be halted and the system reverted to a steady state. Gremlin also limits the "blast radius" of a change -- the amount of damage it can potentially do.

For security, Gremlin only communicates over SSL, and supports precautions such as permission controls, single sign-on, and role-based access controls.

And for simplicity, Gremlin uses intuitive user interfaces to walk people through running experiments, reporting and controlling tests. The service includes an API to integrate with third-party software, as well as a command line interface for advanced users, Andrus says.

Gremlin tests for a variety of types of failures: CPU failures, disk and memory overconsumption, virtual machine failures, container failures, failures to synchronize clocks, network problems such as failures to resolve DNS, AWS S3 failures, and more.

"It's a bit like a fire drill," Andrus says. "You want to test these things properly, you want to give people an opportunity to practice it, during the day, when their caffeine has kicked in." That way, when the real failure comes in the middle of the night, IT will be ready

Related posts:

— Mitch Wagner Follow me on Twitter Visit my LinkedIn profile Visit my blog Follow me on Facebook Editor, Enterprise Cloud News

(1)  | 
Comment  | 
Print  | 
Newest First  |  Oldest First  |  Threaded View        ADD A COMMENT
More Blogs from Wagner’s Ring
SD-WAN is about more than saving money – it also provides application delivery, insights and reliability. Find out more in this podcast sponsored by Citrix.
Platform is designed to enable enterprises to build big data analytics apps that move easily between public and private clouds.
Buying Evident.io extends Palo Alto's portfolio with API-based security capabilities and compliance automation.
Google wants to win the hearts of enterprise IT for Chrome OS on the desktop, but it has a long way to go.
IBM Cloud gets a security and Kubernetes performance boost.
Featured Video
From The Founder
Light Reading founder Steve Saunders talks with VMware's Shekar Ayyar, who explains why cloud architectures are becoming more distributed, what that means for workloads, and why telcos can still be significant cloud services players.
Flash Poll
Upcoming Live Events
May 14-16, 2018, Austin Convention Center
May 14, 2018, Brazos Hall, Austin, Texas
September 24-26, 2018, Westin Westminster, Denver
October 9, 2018, The Westin Times Square, New York
October 23, 2018, Georgia World Congress Centre, Atlanta, GA
November 7-8, 2018, London, United Kingdom
November 8, 2018, The Montcalm by Marble Arch, London
November 15, 2018, The Westin Times Square, New York
December 4-6, 2018, Lisbon, Portugal
All Upcoming Live Events
Hot Topics
I'm Back for the Future of Communications
Phil Harvey, US News Editor, 4/20/2018
BDAC Blowback – Ex-Chair Arrested
Mari Silbey, Senior Editor, Cable/Video, 4/17/2018
Verizon: Lack of Interoperability, Consistency Slows Automation
Carol Wilson, Editor-at-large, 4/18/2018
AT&T Exec Dishes That He's Not So Hot on Rival-Partner Comcast
Mari Silbey, Senior Editor, Cable/Video, 4/19/2018
US Govt. Bans Domestic Component Sales to ZTE
Dan Jones, Mobile Editor, 4/16/2018
Animals with Phones
I Heard There Was a Dresscode... Click Here
Live Digital Audio

A CSP's digital transformation involves so much more than technology. Crucial – and often most challenging – is the cultural transformation that goes along with it. As Sigma's Chief Technology Officer, Catherine Michel has extensive experience with technology as she leads the company's entire product portfolio and strategy. But she's also no stranger to merging technology and culture, having taken a company — Tribold — from inception to acquisition (by Sigma in 2013), and she continues to advise service providers on how to drive their own transformations. This impressive female leader and vocal advocate for other women in the industry will join Women in Comms for a live radio show to discuss all things digital transformation, including the cultural transformation that goes along with it.

Like Us on Facebook
Twitter Feed