"A wild Demogorgon just wrecked your Kubernetes cluster"

A fun way of training for production incidents

Dungeon master: “You get an email notification from Stackdriver telling you that the amount of 504 errors in the Client Eastwood service is high and rising.”
Team: “View the logs of Client Eastwood to check for anomalies.”
Dungeon master: “There have been no Client Eastwood logs for the past 3 minutes.”
Team: 😱

We just spent an entire day in a war room playing Dungeons and Dragons. Not with Demogorgons and mythical creatures, but with misconfigured kubernetes secrets, rogue firewalls and missing database records.

I am a part of the team at Q42 that’s responsible for the Philips Hue Cloud infrastructure. Our goal is to have the lights respond to your requests as quickly as possible when you interact with them via Alexa or Google Home.

A new team

Recently there have been a lot of team changes. Two of the most experienced developers have moved on to different teams. That leaves quite a knowledge gap that we’re trying to close. Next to the usual pair programming, PR reviews and ‘Getting to know Hue’ sessions, we initiated two new methods to get the new team up to speed: tabletop role playing and deliberately breaking our acceptance environment.

We loosely based this on a talk by Franklin Hu at the Lead Developer Conference London this year about how they approach training new team members at Stripe. He in turn got his game day idea from a blog post by Etsy on purposefully breaking things in production.

Tabletop roleplaying

The idea is that we hypothesize a scenario where something on production breaks. The dungeon master starts off by notifying the team that there’s a problem. An alert triggers, a product owner calls, we’re deploying a new feature, something of that sort. The team can ask questions to the dungeon master to get a better understanding of what is going on. The goal is to determine what the problem is, and to provide a solution. In the process, the whole team learns about the best way to approach these kinds of issues.

Dungeon master: “To get inspiration on the scenarios that we could throw at the team, we looked through our post mortem write-ups and selected three interesting ones. The scenarios had manifested themselves, when none of the new team members were on board yet, so they didn’t know (a lot) about these encounters. We had the metric data of those events in Datadog (our monitoring dashboard), so we could show these to the team if they asked to see them.”

Dungeon Master: “The product owner calls and says that a significant amount of the client disconnected simultaneously.”
Team: “View the status dashboard.”
Dungeon Master: “All systems are fully operational, there is just a huge amount of clients missing.”
…after hypothetically viewing lots of dashboards, logs & databases...
Dungeon Master: “You look at the data of some of the missing clients. Suddenly you see something suspicious: their IP numbers are in the same range…”
Team: “Ah! It’s an outage with an Internet Service Provider!”
Dungeon Master: “Bingo! Half of Germany doesn’t have any internets…”

Team: “This was very insightful, and good to test out while there is not a real production outage and people breathing down our necks. There's no place like production, but this came pretty close!”

Herman role-playing Google Priority Support: “How can I help you?”

Dungeon master: “It was fun to see the team struggle for a while trying to get a sense of what is going on, and the relief when they found a breakthrough”.

“Timeline of events of the virtual production outage”

Game scenarios, or Deliberately breaking acceptance

There are also situations that can be more easily reproduced on an acceptance environment. Even the product owner was in on it: after lunch he called to say there were complaints from testers that couldn't connect to their Philips Hue bridge on the acceptance environment. Team, figure it out! This turned out to be a NGINX-server that was wrongly configured when routing data into our cluster. It ran outside our cluster and we had no control over it, so providing a quick solution was a challenge.

Another fun one: the kubernetes command line tool could not successfully send commands to our cluster any more. This turned out to be because of a rogue firewall rule that didn't allow the kubernetes master to access the nodes themselves.

Wrap-up

At the end of the day the new team members had touched parts of the code that they hadn’t known about until then. This significantly improved their knowledge about approaching production issues. We also had a lot of fun learning it. Getting as close to the real thing as we could, really helped the thinking process as well as getting comfortable searching for unknown problems. It’s also a great way to get some hands-on information on parts of the architecture you may not have seen yet.

It really felt like playing actual DnD! But with our code! Which was on fire!
~ Team member Roan

Try it yourself!

This way of training is a great way to pass on knowledge, and it's fun for everyone involved. If you want to feel more comfortable with your team jumping into production issues, this is a great way to get experience in that field without having to wait for production to break.

If you’re looking for inspiration on tabletop scenarios, check out this Twitter account that posts tabletop scenarios they play at Stripe.


Always wanted to play DnD at work? Check our job vacancy (in Dutch) at https://werkenbij.q42.nl! :)