Dreaming Andy - Fotolia


Testers add this new skill to your repertoire: chaos engineering

Software testers just want to just focus on the application. But networks, clouds and other things can break your software too. Gerie Owen explains how chaos engineering can help.

What is chaos engineering, you might ask? Is it something testers should be concerned with? 

Chaos engineering, or chaos monkey, is a technique originally pioneered by Netflix to look for different and often drastic ways to break an application. The goal is to ensure that anything that can go wrong in production is tested and evaluated, usually before application deployment but sometimes in production too.

The problem is that testing today's web applications is far more complex than simply developing tests from requirements and running them in the lab, in a controlled and measured environment. There are so many moving parts to an application that traditional testing can't cover all of the possible use cases.

Further, the deployment environment today is far more complex than it has ever been, often involving multiple cloud data centers, different Internet segments, and different traffic routings.  Throw in IoT devices, and there are literally an infinite number of possible combinations in the production environment.

Are you testing all of them? Before production? I didn't think so. And a single point of failure in production can cause a disaster. Testers, welcome to the brave new world of chaos.

Why are we trying to break our applications?

The computing world is a different place than a decade ago. Most applications are web-based, and/or delivered from the cloud. They include many third-party components, such as open source libraries, purchased code, and third-party services such as advertisements. And they frequently change, as agile project updates add new features or address issues.

And network services remain relatively fragile, especially in comparison to the internal datacenter. Data from dozens of servers and many users travels for thousands of miles, across an almost infinite number of possible routes, following the DNS address tables.

The goal is to ensure that anything that can go wrong in production is tested and evaluated, usually before application deployment but sometimes in production too.

Chaos engineering works from a detailed engineering process to define the normal behavior of a system, developing an experimental scenario (such as shutting down a server or breaking a network connection), then carrying out that scenario and comparing the resulting behavior to the normal behavior. Has performance changed? Have we lost availability? Is the application accessible from different parts of the world? Does the application still work, but is lacking certain essential features?

Building on application testing

Chaos engineering is important because these are not applications and features that can be tested to any reasonable extent in a lab before deployment. It's not possible to accurately replicate worldwide usage, multiple DNS services, and third-party services before deployment. Testers do the best that they can, but it is by necessity limited and ultimately unrepresentative of production use.

What does this have to do with testing? Many of the most competent testers I've met get at least some motivation out of breaking things. This gives them more options to do so. And the problem is far more complex than it was a decade ago,

So yes, chaos engineering is an emerging skill set for testers. Think of it as extreme exploratory testing, that involves not only the application but also the operating systems, servers (locally or in the cloud), databases, and network. Testers have a wide range of tools available to look at parts of not only the application but also the delivery environment.

It is a disciplined approach, but it can also be enjoyable and professionally fulfilling. Testers get the opportunity to help make sure that an application can achieve a very high uptime, and is relatively protected against attacks and other types of actions.

Any application can be broken, whether through the code, the network, the cloud provider, or the hardware. The question is what type of event, or combination of events, will make it happen. And when it does, the application doesn't necessarily come back to the lab. Instead, it's diagnosed and at least mitigated in real time, before it starts costing the organization money or reputation.

Testers may argue that organizations have no control over performance and availability on the Internet, so this is out of the purview of testing. But yes, they do. They can have multiple host providers, around the world. They can have secondary DNS services. They can do intelligent traffic routing.  It is our responsibility as guardians of the organization's reputation to broaden our own and our organizations' point of view.

Testers are used to focusing on the application and its code in a vacuum. That approach is changing, but not fast enough. Testers need to look at complex interactions between code, services, network, and provider. Future tests have to include the ability to take all of these factors into account, even breaking some of them through the practice of chaos engineering.

Next Steps

Why crowdsourced testing can up your game

Mix up your strategy by testing local

Why software testers can't stand still

Dig Deeper on Exploratory testing