Sometimes, the only way to identify defects in application infrastructure is to push applications past their limits...
strategically when demand is low. Developers and testers have embraced chaos engineering, a form of testing that intentionally shuts down servers, turns off app services or changes server clocks.
Chaos engineering ensures apps will hum along when things break in real life or during peak demand. With chaos engineering, cutting-edge enterprises mitigate harm from large-scale disruptions -- like the Amazon Simple Storage Service (S3) outage -- denial-of-service (DoS) attacks or busy holiday sales.
"[Chaos engineering is] by far the most efficient path to achieving resilience, engineering excellence and developing the skills necessary for today's microservice-based systems," Bruce Wong, director of engineering at online shopping service Stitch Fix, said. "Chaos engineering gives [teams] the confidence to be on call and build super-resilient systems that rarely fail. There's a dynamic shift that happens: Engineers start saying 'when' instead of 'if' a component fails."
Both Amazon and Netflix pioneered chaos engineering for their respective application development and testing programs. Netflix even made a variety of internal tools open source, such as Chaos Monkey and Chaos Kong to ensure resilience to regional failure, Chaos Automation Platform and Failure Injection Testing at the microservice level.
However, these tools can be complicated to implement effectively. Chaos engineering can adversely affect live business apps, create security vulnerabilities or instigate extensive cascading failures if it is implemented poorly.
Failure as a service
Gremlin offers a failure-as-a-service tool to make chaos engineering easier to deploy. It includes a variety of safeguards built in to break infrastructure responsibly. The tool is used by IT shops at companies such as Expedia, Twilio and Confluent.
Gremlin was founded by CEO Kolton Andrus and CTO Matthew Fornaciari. The two worked on custom chaos engineering tools for Amazon, Netflix and Salesforce previously.
Chaos engineering is not just about breaking things randomly. "You want to do thoughtful, planned experiments that teach you about your system," Andrus said. "If you break things on purpose, it is an opportunity not just to build better systems, but also to train people."
Gremlin sets up controlled chaos experiences for enterprises that want to dip their toes in the water. Much like a hackathon, these GameDays let developers, testers and ops teams experiment with introducing different kinds of chaos systems. Teams assess applications' performance and monitor environments, gaining awareness of different kinds of failures. For example, an S3 outage might first be reported as an app failure on the team's dashboard, rather than indicating the root cause.
Different kinds of failure
Gremlin's service scales up each failure mode gradually, with the option to revert the experiment if things go wrong. The approach works on failure modes, such as CPU performance throttling, packet delays or clock adjustments. Other failure modes -- for example, host reboots -- can be more problematic to undo and should be implemented with greater caution.
Nora JonesSenior software engineer, Netflix
The chaos engineering tool's approach of breaking pieces in a complex system can help to identify components of the application stack that can cause cascading failures across enterprise systems. The three main kinds of failure modes are:
- Resource outages or slowdowns in CPU, storage and memory;
- Host containers that are rebooted or clocks that change; and
- Network issues, such as when connections slow, domain name system services go down or an enterprise experiences a DoS.
Even surprisingly simple failure modes cause large-scale problems. For instance, an application works fine until a server clock drifts or daylight saving time adjustment rolls across the world and affects clocks differently. Small changes could lead to event scheduling conflicts that cascade out into other app services.
Use chaos to reveal problems
Unit testing is good for performing tests on known component behavior. Similarly, integration testing is good for tests on known connections between components. But chaos engineering is more about software engineers and QA personnel finding unknown problems through experimentation.
"Chaos engineering is not meant to replace unit and integration tests," Jones explained. "[The three] are meant to work together in harmony to give you the most availability possible in order to ensure that your customers have a great experience and your business stays up and running."
According to Jones, chaos engineering experiments can take a lot of time to plan and set up -- especially when it comes to deciding good injection points for introducing failure and identifying good failure scenarios. The future lies in crafting algorithms that will make it easier to decide the best chaos experiments to run and that would make it easier to perform critical experiments more frequently.
"Everyone can and should be doing this," Jones said. "Chaos doesn't cause problems; it reveals them."