Chaos engineers try to break things in live software environments, but they must exercise great control during...
the process. In fact, only disciplined QA teams should attempt this method at all.
While chaos engineering practices can lead to fewer surprise breakdowns in production, as well as foreknowledge of how to save the day when failures do occur, not every software organization is prepared to flip the switch. QA teams must master test environment control, fail-safe processes and behavior monitoring before they attempt chaos engineering practices.
Three experts in chaos engineering practices laid out prerequisites for adoption, ways to pitch this unconventional approach to management and often-overlooked chaos engineering methods.
Prerequisites of chaos engineering
Before it adds chaos engineering as a QA practice, an organization should have mature competencies in secure testing, as well as monitoring in place. It should also implement standard practices for test engineering, governance and management.
"A QA team has to build up muscle to be able to carry the weight of failure testing," said Vilas Veeraraghavan, director of engineering for Walmart Labs in Sunnyvale, Calif. "Few people can walk into a gym for the first time and lift 100 pounds."
To evaluate readiness for chaos engineering, an administrator should first assess how well the team handles failure recovery, Veeraraghavan said. The team, for example, should have playbooks that set out procedures for various types of recovery scenarios.
Next, evaluate the organization's analytics capabilities. These analytics should drill down to event-level and user actions, as well as provide behavioral insights and analysis, said Kolton Andrus, CEO of Gremlin, a failure test services vendor based in San Jose, Calif. Advanced monitoring, often referred to as observability, helps a QA team evaluate the consequences of failures.
Don't underestimate the analytics and behavior intelligence competence needed for failure testing, said Charity Majors, CEO of Honeycomb, a production debugging tool vendor, and co-author of the book Database Reliability Engineering. Without analytical insights into code and user behaviors, testers can't correct course or learn how to fix error-producing flaws.
Only QA teams with mature security processes can mitigate the risk in chaos engineering. "Running these experiments is akin to DDoS-ing [distributed denial-of-service] yourself," Andrus said. "If the wrong person can get hold of your chaos experiment methods or results, they could wreak havoc in your environment."
Pitch chaos engineering
When QA makes its pitch for chaos engineering to management, establish evidence of competency. Also, show statistics on the cost of failures to prove the benefit of aggressive approaches to prevent downtime.
Veeraraghavan recommends a different approach, which is to avoid the word chaos altogether. Instead, he focuses on the phrase IT resilience, and he prefers the term resilience engineering. "The word chaos doesn't inspire a lot of confidence for most senior management and executives," he said.
Charity MajorsCEO, Honeycomb
Ignorance is not bliss when it comes to risk of software and system failures. Yet, many business managers are not as wary of flaws in their production systems as they should be, Majors said. It's difficult for QA teams to identify the many catastrophic states that exist in complex modern systems. A pitch should present chaos engineering as a safe means to discover unknown failure points, she said.
Also, inform management that release engineering is a flaw-infested process. "Deployment is a common point of failure, yet businesses commonly underinvest in it," Majors said. Testing failure points can ensure that deployment processes don't introduce problems.
"Is the code you shipped doing what you thought it would? You don't know until you try chaos engineering for production software and systems," Majors said.
Chaos engineering best practices
Luckily, chaos engineering practices don't require a team to reengineer its existing software QA process. Instead, it complements existing methodologies, such as Agile and DevOps, as well as disaster recovery and incident and crisis management. Also, this style of testing extends test-driven development beyond deployment and provides a controlled environment for production testing.
Fight the temptation to initiate large-scale chaos engineering projects. To limit the harm a project could inflict, keep the scope of each fault injection experiment small. Before Majors started Honeycomb, she performed chaos engineering at Facebook and often limited an experiment to 5% to 10% of users in a geographical area. In a small sample, it was simple to automate rollbacks when errors occurred and to build security measures to protect the project.
Developers should be involved through the entire lifecycles of the software they build, Majors said. The developer knows the software's context and can identify problems faster than an ops person, because, for one thing, ops doesn't know what the app's normal state looks like. In Majors' experience, a team can make faster fixes when developers are attached to apps they build.