Chaos Testing

We have been practicing what has become to be called “Chaos Testing” on my current project in AWS. Chaos Testing means bringing a little bit of chaos into your life now, when it's expected, to better prepare yourself for when things unexpectedly get chaotic. Real live chaos is almost never expected, so it is always good to be prepared for when it inevitably rears its mangy head.

Note: This is different, but related to Chaos Engineering. Chaos Engineering is injecting faults at random in production to test fault tolerance. Chaos Testing in this sense is more akin to emergency preparedness drills. Both are useful practices and complement each other nicely.

Breaking things, with meaning

Chaos testing can be boiled down to breaking something and then seeing how people react to things being broken. A break could be as simple as removing a line from an IAM policy or changing the outbound rules on a security group to removing a record from Route53. What's important is that it is easily reversible.

The broken resource or configuration isn't the important part. In my experience, it is almost impossible to introduce a “realistic” fault in a system. If it is easy to introduce a realistic fault in your system, that's probably just a bug you need to fix. Chaos testing isn't necessarily for discovering bugs in your systems, it's for finding bugs in your human procedures.

Finding bugs in your team's incidence response procedures usually happens in a retrospective, after-action report, or finger-pointing session. Whatever you call it, by the time it happens it's too late. What chaos testing gives us is the ability to suss out the defects in these procedures before they happen when it really counts. It gives the team a safe space to explore the system and try to figure out what's going on. An escape room, not the torture chamber.

Making Room

When you want to run some chaos testing, it can help to schedule some time for the team. It takes away time from normal development, so everyone needs to be on board. The main thing to remember when scheduling this meeting is to not book a room or include a Teams meeting or slack channel or anything like that. Part of the test is to see how the team responds to the issue by pulling people together to help solve it. When DNS fails on production because someone pushed the wrong config, you're not going to have a meeting scheduled to fix it.

If you've been calling yourself production ready, during the course of the test you should trigger some alarms. Scheduling time for the test ensures that the people who get these alarms aren't freaking out.

Alarms, Logs, and Monitoring

Think about which alarms you have in place, what logs and dashboards are available, and general visibility into the platform. Before you conduct your test and break something, what do you expect to see? Should an alarm trigger? Should a visualization see a spike on a graph? Should your logs get spammed with exception output? If you break your system and your hypotheses are incorrect, that may indicate poor understanding or a gap in your overall SRE stance.

An alarm should be the first indication that something is broken. If the first time you hear that something is broken is from a user, that's bad. That means its been broken for a while. It means that someone got frustrated enough to complain. It means that more people are going to find out that something is broken and you're going to have to explain that it's not a big deal, really. You already have a ticket in the backlog to fix it but it wasn't prioritized, and you have a PR out for it waiting to be merged. And nobody cares, because it's still broken.

Which things should you alarm on? It depends greatly on what you're building, but there are some general things:

How to Fail Successfully

You've identified your alarms that you expect to trigger and the logs you expect to fill. Now you need to inject a fault into the system somewhere. This is heavily dependent on your particular infrastructure, but here are some ideas:

It is rarely the case that a single thing is broken. To ramp it up a notch, try breaking multiple things simultaneously to see how the team diagnoses both issues. There will typically be an overarching issue, like a route table that is no longer allowing external traffic, along with a permissions change that is revealed later when the network is fixed. The specific scenario may be completely fake and arbitrary, but diagnosing and solving an issue with multiple causes can be trained for.

Conclusion

Chaos testing is a useful practice. Identifying gaps in your knowledge, skills, and infrastructure is better in a controlled environment. It is good to test the team's response and run drills to improve as a team, before you have those uncomfortable meetings.

Some takeaways I've had from doing this: