james

james at jamesekstrom.com

Around a snack-laden table in a dimly lit room sit six. There is a din of chatter between four of them while the fifth is scribbling something on a scrap of paper. The sixth sits at the head of the table in a clear position of authority. This dungeon master begins to speak and the rest quiet.

“When last we met our heroes, they had received a letter from a mysterious stranger. Inside the letter, Mort read,” he says in his best theater announcer voice, before continuing in the worst eastern-European accent “'Please help. My daughter has been bitten by a vampire, and I'm afraid it will come again soon. We are simple folk and we beg for brave adventurers to come and save our land from this monster.'”

Another jumps in with a sly quip: “How much the vampire paying though?”

“The letter continues,” continues the narrator amidst dying laughter, “'A healthy sum of gold has been gathered by the community and will be your reward, along with our eternal gratitude.' Signed Master Benos.”

There's a lull at the table. The others are still in the process of entering that trance-like state to participate in the performance.

“So Mort just finished reading this letter and the mysterious stranger is standing there in front of you. The door to the tavern is slightly ajar. A light pitter-patter of rain is heard bouncing off the cobblestones outside. The glint of the stranger's eyes flash towards the group from beneath the shadows of his hood.”

They begin responding with what their characters do, how their characters react. This is but a brief phase in the early stages of the night, before the mood shifts.

The players, leaving behind their own humanity like a drunk, become closer to an innate, intuitive being in the unfolding story. A change of context. Once the drug metabolizes, each performer enters the mutual mindspace; becomes their character, thinks and acts in the context of the shared story, attains that group-induced mental state that is possible only under some unknowable yet readily achievable parameters. In the same fleeting and fragile way some music induces tears yet some bring dance.

We now find the narrator is not authority but rather another participant in the act, and the performers are at once both spectator and performer. Each waits, observing, until his or her moment to act or react, to add on to the story. The dungeon master becomes the Apollonian referee and rules lawyer when not joining as an equal with the spectator performers as Dionysian.

The dance of the game arises from the differentiation between players' Dionysian state and the dungeon master's Apollonian. This is especially evident when the established rules are challenged in interesting and fun ways. For with no rules, they could not be challenged. Contrast begets enjoyment. An agreed upon canvas upon which to perform, and boundaries to perform within, counter-intuitively allows for more creativity.

Games

Dungeons and Dragons is not unique in this. There's a spectrum, and some games fall to one side or the other. D&D and other table-top RPGs like Pathfinder fall closer to the middle. A game like Paranoia leans further towards Dionysian and a game like Checkers would be more Apollonian. Board games in general tend to be more on the logical/rational side of things. Going too far in either direction leads to a less fun game. Complete make-believe with no rules or always-changing rules like Calvinball or Mao may be fun for kids, but is hard for adults to enjoy after a short while. Similarly, a game with very strict rules that allows no creativity like Tic-Tac-Toe or Sudoku are games by rule only.

In Tic-Tac-Toe or Sudoku, there is a correct answer. In Dungeons and Dragons, there are only wrong answers. Want to cast Fireball? You can't, you don't know that spell. Want to swing on that vine over the gorge? Well, give it a try: roll an ability check. You can succeed on your check or you can fail. Then what happens? Did you character die? That is an interesting part of the story. Did you kill the big bad evil guy? Turns out that was only a pawn in some larger scheme.

Death and tragedy are necessary for any campaign. Failing is part of life and it should be part of play as well. When you fail, you can learn to react and overcome. Perhaps you failed because you made a suboptimal decision, or you were inexperienced. Failing is fun. Watching others fail in spectacular ways is delightful. When a character dies, the story changes. A new chapter begins. It sets up a cathartic revenge plot or the tragic downfall of the heroes.

Death was not a stranger to the players in my Curse of Strahd campaign. This set up great animosity towards the evil vampire. Most of the characters were so bent on revenge they themselves turned to evil and twisted means for more power. Once the party eventually confronted Strahd in his castle, a fight broke out. One after another, Strahd killed the heroes. He turned some into his thralls, just to twist the knife. The campaign ended. The villain won, but the story is told back by the players with a tone of emotional purification. A fitting, even just end to those characters. Perhaps a new campaign starts with new heroes seeking justice against their slain and turned relations.

The inherent essence of D&D being both rational and rules based, chaotic and improvisational, Apollonian and Dionysian, allows the players to explore the human condition. Within the messy overlap of this constitution, in the harmony between sobriety and drunkenness, we find the inseparable amusement and torment of existence. We cheer and laugh when a character falls. We groan when a player makes the obvious joke. We spend hours crafting a compelling backstory for Ralic Helmsgore, the Dwarf Paladin, along with his family members and loved ones. Create his bonds, secrets, and flaws. We devote hundreds of hours, playing every week for four hours (or more). Then we relish in the moment when he goes against his oath to torture the lieutenant of the lich who killed his comrade. All that he lived for, forsaken. One unplanned moment in the game now defines that character. There may be a similar event in your own life; one day, one hour, one minute that from that time forward you were a new person. You had to be. Perhaps of no choice of your own. Hard moments in real life to analogues of suffering in a game. Few other games have the capacity to represent the human condition so fully.

Meaning

Dungeons and Dragons brings people together, often for different reasons. Some enjoy the game for its own sake: rolling dice, slaying monsters, and leveling up. Others come together to hang out with friends, and some because they enjoy role playing or improv storytelling. What meaning each gets from the game is as legitimate as the next, just as when one takes in a painting and finds a meaning different from the artist's intent. If you were to write a transcript for a session or record it on video, it would be simply representational, a fraction of the total dimensions of the experience. To get the full extent of meaning from the game you have to live it. Rejoice when your friend makes that death save. Fear when the lich separates the party. Embarrassment when you botch the accent. Pass the cheetos. Crack a beer.

There is meaning inside the shared fabrication of the game as well. Explicit meaning from quests and tasks that need to be performed. Challenges to overcome. Sharing in the storytelling and witnessing your friends' performances has a way of affirming and validating your own lived experiences. Contrasting one's own life with the one-dimensional characters portrayed in the game is a good way to remove the pessimism often found in people's nihilistic beliefs of the world. Even the player's own character is playing a role, and has no real agency. Often players will have their characters do unnecessarily cruel and torturous things because they are cruel. It's fun. It's make-believe. A stark contrast from reality where we value life and avoid causing unnecessary suffering. There is meaning to our human existence. Meaning that is granted by our own will and agency, experienced uniquely, shared only in fragments. Meaning that is revealed by contrast.

We have been practicing what has become to be called “Chaos Testing” on my current project in AWS. Chaos Testing means bringing a little bit of chaos into your life now, when it's expected, to better prepare yourself for when things unexpectedly get chaotic. Real live chaos is almost never expected, so it is always good to be prepared for when it inevitably rears its mangy head.

Note: This is different, but related to Chaos Engineering. Chaos Engineering is injecting faults at random in production to test fault tolerance. Chaos Testing in this sense is more akin to emergency preparedness drills. Both are useful practices and complement each other nicely.

Breaking things, with meaning

Chaos testing can be boiled down to breaking something and then seeing how people react to things being broken. A break could be as simple as removing a line from an IAM policy or changing the outbound rules on a security group to removing a record from Route53. What's important is that it is easily reversible.

The broken resource or configuration isn't the important part. In my experience, it is almost impossible to introduce a “realistic” fault in a system. If it is easy to introduce a realistic fault in your system, that's probably just a bug you need to fix. Chaos testing isn't necessarily for discovering bugs in your systems, it's for finding bugs in your human procedures.

Finding bugs in your team's incidence response procedures usually happens in a retrospective, after-action report, or finger-pointing session. Whatever you call it, by the time it happens it's too late. What chaos testing gives us is the ability to suss out the defects in these procedures before they happen when it really counts. It gives the team a safe space to explore the system and try to figure out what's going on. An escape room, not the torture chamber.

Making Room

When you want to run some chaos testing, it can help to schedule some time for the team. It takes away time from normal development, so everyone needs to be on board. The main thing to remember when scheduling this meeting is to not book a room or include a Teams meeting or slack channel or anything like that. Part of the test is to see how the team responds to the issue by pulling people together to help solve it. When DNS fails on production because someone pushed the wrong config, you're not going to have a meeting scheduled to fix it.

If you've been calling yourself production ready, during the course of the test you should trigger some alarms. Scheduling time for the test ensures that the people who get these alarms aren't freaking out.

Alarms, Logs, and Monitoring

Think about which alarms you have in place, what logs and dashboards are available, and general visibility into the platform. Before you conduct your test and break something, what do you expect to see? Should an alarm trigger? Should a visualization see a spike on a graph? Should your logs get spammed with exception output? If you break your system and your hypotheses are incorrect, that may indicate poor understanding or a gap in your overall SRE stance.

An alarm should be the first indication that something is broken. If the first time you hear that something is broken is from a user, that's bad. That means its been broken for a while. It means that someone got frustrated enough to complain. It means that more people are going to find out that something is broken and you're going to have to explain that it's not a big deal, really. You already have a ticket in the backlog to fix it but it wasn't prioritized, and you have a PR out for it waiting to be merged. And nobody cares, because it's still broken.

Which things should you alarm on? It depends greatly on what you're building, but there are some general things:

  • Excessive errors. 5xx's, 4xx's, application exceptions, etc.
    • Set a good baseline for these. You don't want to generate noise when a spammer hits your website on a weekend and generates a load of 404's, but if nearly every request suddenly turns into 404's that's something worth alarming on.
  • Integration failures.
    • Your application likely calls out to other APIs or third-party resources. If these calls fail, you want to know about it quickly.
  • Health checks.
    • You should regularly check the general health of your system. This could be as simple as having a process call the home page and excepting a 200 response. Health checks can also report on the health of other internal integrations or parts of the system to give a better overall picture. If your system is unhealthy, you want to trigger an alarm.
  • Critical sub-process failures.
    • Things that aren't immediately user facing, but still critical. Things like daily reports, analytics, emails, notifications, replication jobs, etc. If any of these fail you need to know soon before data goes stale and you end up in even more meetings.
  • Platform issues.
    • In the rare case when, say, an entire region in AWS goes down, you may want an alert to trigger. There may not be much you can do about it, but you can at least notify users (and your boss) about the issue. This will require setting up alerting systems outside of your primary system/platform.

How to Fail Successfully

You've identified your alarms that you expect to trigger and the logs you expect to fill. Now you need to inject a fault into the system somewhere. This is heavily dependent on your particular infrastructure, but here are some ideas:

  • Network.
    • Change routing tables, subnets, NAT gateways, etc.
    • What you're looking for here is disrupting traffic to other necessary components, internal or external.
  • DNS.
    • Change or remove DNS records.
    • This is a major change, but it can be particularly difficult to diagnose if you're not expecting it.
    • What you're looking for here is any gaps in external alarming, general downtime alarms, and diagnosing in an absence of logs.
  • Permissions.
    • Change or remove an IAM policy or equivalent.
    • What you're looking for here is being able to diagnose an issue that is more specific to the application's function. Perhaps a single lambda no longer has GetObject rights to an S3 bucket.
  • Configuration.
    • Change environment variables, secret values, etc.
    • What alarms get triggered and logs written when something is misconfigured?
    • Similar to a permissions issue, this should reveal itself in a general alarm and/or log entries for particular functions of the application. Good root cause analysis skills should lead engineers back to the configuration that changed.

It is rarely the case that a single thing is broken. To ramp it up a notch, try breaking multiple things simultaneously to see how the team diagnoses both issues. There will typically be an overarching issue, like a route table that is no longer allowing external traffic, along with a permissions change that is revealed later when the network is fixed. The specific scenario may be completely fake and arbitrary, but diagnosing and solving an issue with multiple causes can be trained for.

Conclusion

Chaos testing is a useful practice. Identifying gaps in your knowledge, skills, and infrastructure is better in a controlled environment. It is good to test the team's response and run drills to improve as a team, before you have those uncomfortable meetings.

Some takeaways I've had from doing this:

  • The team quickly responds when something is broken (especially when you've scheduled time for it). Some people prefer calls and some prefer just a slack channel with relevant people.
  • Not everyone needs to be involved in the fix. Bring people in when necessary.
  • The alarms you thought would be triggered won't be. It's better to find this out early.
  • It may be trickier to correlate logs than you anticipated. You might have to get creative with log queries.
  • After a thorough investigation of your logs, you will end up with more bugs than you intentionally introduced. Sometimes there are issues that are not very evident. Embrace it.
  • That pretty dashboard that you built might not be very useful in a shit-hits-the-fan scenario.
  • QE engineers often have more insight into the application than the SDE engineers. Bring anyone in that can help.
  • The obvious thing that you broke won't be so obvious to the person looking.
  • If you initiated the test, it's hard to stay quiet when you know what the issue is, but it's important for the test.
  • It's actually kind of fun.