Chaos Testing
We have been practicing what has become to be called “Chaos Testing” on my current project in AWS. Chaos Testing means bringing a little bit of chaos into your life now, when it's expected, to better prepare yourself for when things unexpectedly get chaotic. Real live chaos is almost never expected, so it is always good to be prepared for when it inevitably rears its mangy head.
Note: This is different, but related to Chaos Engineering. Chaos Engineering is injecting faults at random in production to test fault tolerance. Chaos Testing in this sense is more akin to emergency preparedness drills. Both are useful practices and complement each other nicely.
Breaking things, with meaning
Chaos testing can be boiled down to breaking something and then seeing how people react to things being broken. A break could be as simple as removing a line from an IAM policy or changing the outbound rules on a security group to removing a record from Route53. What's important is that it is easily reversible.
The broken resource or configuration isn't the important part. In my experience, it is almost impossible to introduce a “realistic” fault in a system. If it is easy to introduce a realistic fault in your system, that's probably just a bug you need to fix. Chaos testing isn't necessarily for discovering bugs in your systems, it's for finding bugs in your human procedures.
Finding bugs in your team's incidence response procedures usually happens in a retrospective, after-action report, or finger-pointing session. Whatever you call it, by the time it happens it's too late. What chaos testing gives us is the ability to suss out the defects in these procedures before they happen when it really counts. It gives the team a safe space to explore the system and try to figure out what's going on. An escape room, not the torture chamber.
Making Room
When you want to run some chaos testing, it can help to schedule some time for the team. It takes away time from normal development, so everyone needs to be on board. The main thing to remember when scheduling this meeting is to not book a room or include a Teams meeting or slack channel or anything like that. Part of the test is to see how the team responds to the issue by pulling people together to help solve it. When DNS fails on production because someone pushed the wrong config, you're not going to have a meeting scheduled to fix it.
If you've been calling yourself production ready, during the course of the test you should trigger some alarms. Scheduling time for the test ensures that the people who get these alarms aren't freaking out.
Alarms, Logs, and Monitoring
Think about which alarms you have in place, what logs and dashboards are available, and general visibility into the platform. Before you conduct your test and break something, what do you expect to see? Should an alarm trigger? Should a visualization see a spike on a graph? Should your logs get spammed with exception output? If you break your system and your hypotheses are incorrect, that may indicate poor understanding or a gap in your overall SRE stance.
An alarm should be the first indication that something is broken. If the first time you hear that something is broken is from a user, that's bad. That means its been broken for a while. It means that someone got frustrated enough to complain. It means that more people are going to find out that something is broken and you're going to have to explain that it's not a big deal, really. You already have a ticket in the backlog to fix it but it wasn't prioritized, and you have a PR out for it waiting to be merged. And nobody cares, because it's still broken.
Which things should you alarm on? It depends greatly on what you're building, but there are some general things:
- Excessive errors. 5xx's, 4xx's, application exceptions, etc.
- Set a good baseline for these. You don't want to generate noise when a spammer hits your website on a weekend and generates a load of 404's, but if nearly every request suddenly turns into 404's that's something worth alarming on.
- Integration failures.
- Your application likely calls out to other APIs or third-party resources. If these calls fail, you want to know about it quickly.
- Health checks.
- You should regularly check the general health of your system. This could be as simple as having a process call the home page and excepting a 200 response. Health checks can also report on the health of other internal integrations or parts of the system to give a better overall picture. If your system is unhealthy, you want to trigger an alarm.
- Critical sub-process failures.
- Things that aren't immediately user facing, but still critical. Things like daily reports, analytics, emails, notifications, replication jobs, etc. If any of these fail you need to know soon before data goes stale and you end up in even more meetings.
- Platform issues.
How to Fail Successfully
You've identified your alarms that you expect to trigger and the logs you expect to fill. Now you need to inject a fault into the system somewhere. This is heavily dependent on your particular infrastructure, but here are some ideas:
- Network.
- Change routing tables, subnets, NAT gateways, etc.
- What you're looking for here is disrupting traffic to other necessary components, internal or external.
- DNS.
- Change or remove DNS records.
- This is a major change, but it can be particularly difficult to diagnose if you're not expecting it.
- What you're looking for here is any gaps in external alarming, general downtime alarms, and diagnosing in an absence of logs.
- Permissions.
- Change or remove an IAM policy or equivalent.
- What you're looking for here is being able to diagnose an issue that is more specific to the application's function. Perhaps a single lambda no longer has GetObject rights to an S3 bucket.
- Configuration.
- Change environment variables, secret values, etc.
- What alarms get triggered and logs written when something is misconfigured?
- Similar to a permissions issue, this should reveal itself in a general alarm and/or log entries for particular functions of the application. Good root cause analysis skills should lead engineers back to the configuration that changed.
It is rarely the case that a single thing is broken. To ramp it up a notch, try breaking multiple things simultaneously to see how the team diagnoses both issues. There will typically be an overarching issue, like a route table that is no longer allowing external traffic, along with a permissions change that is revealed later when the network is fixed. The specific scenario may be completely fake and arbitrary, but diagnosing and solving an issue with multiple causes can be trained for.
Conclusion
Chaos testing is a useful practice. Identifying gaps in your knowledge, skills, and infrastructure is better in a controlled environment. It is good to test the team's response and run drills to improve as a team, before you have those uncomfortable meetings.
Some takeaways I've had from doing this:
- The team quickly responds when something is broken (especially when you've scheduled time for it). Some people prefer calls and some prefer just a slack channel with relevant people.
- Not everyone needs to be involved in the fix. Bring people in when necessary.
- The alarms you thought would be triggered won't be. It's better to find this out early.
- It may be trickier to correlate logs than you anticipated. You might have to get creative with log queries.
- After a thorough investigation of your logs, you will end up with more bugs than you intentionally introduced. Sometimes there are issues that are not very evident. Embrace it.
- That pretty dashboard that you built might not be very useful in a shit-hits-the-fan scenario.
- QE engineers often have more insight into the application than the SDE engineers. Bring anyone in that can help.
- The obvious thing that you broke won't be so obvious to the person looking.
- If you initiated the test, it's hard to stay quiet when you know what the issue is, but it's important for the test.
- It's actually kind of fun.