The complexity of modern IT systems makes it impossible to predict how they will respond to every potential failure or interruption.
Chaos engineering helps organisations explore possible adverse events by designing and running controlled experiments that involve introducing a failure, then analysing the impact and response to that failure.
Chaos engineering experiments present any organisation with certain risks – introducing an error into any IT system could have unintended consequences. This means it’s essential to plan such experiments carefully. If the experiments are to be part of a focussed event, such as a Chaos Day make sure that:
- You have appropriate time to plan the experiments, and people required to plan and execute the experiments are available at the scheduled time.
- Your proposed date does not clash with major business events or planned changes to your existing IT systems
- The existing systems are stable, and will be able to recover from introduced failures or errors
These six questions are designed to help you understand when is the best time to hold a Chaos Day in your organisation:
Does the business consider a Chaos Day to be important at the current time?
Chaos Days generally provide a good return on investment for IT organisations, but the benefits might not be as obvious in the short-term as the benefit of investing in new capabilities. This can cause organisations to ‘put off’ Chaos Days, or not hold them at all.
If this is the case, then we recommend starting with a smaller investment such as a time-boxed risk assessment, using a FAIR (factor analysis to information risk) approach. This type of exercise provides an opportunity to explore what failures could happen, their frequency and the potential impact they would have. The results can give meaningful data that can be used to build a business case for Chaos Days that might persuade stakeholders to prioritise resilience over or alongside new capabilities.
Do we have enough time to plan, execute and review a Chaos Day?
It’s vital that Chaos Days are carefully scheduled to ensure that the organisation has the time and resources available to fully engage with the experiment. The timeline for a Chaos Day must allow sufficient time for planning. We recommend that you arrange a planning session with the whole team at least two weeks before the Chaos Day.
Ideally, the Chaos Day should be scheduled for a date that won’t impact key business events. For example, if the target environment is severely degraded, check this won’t delay any production releases that need to pass through it around that date. You will then need to allow time to schedule review meetings after the Chaos Day for teams, to capture learning and make recommendations.
You’ll need to schedule time to discuss and plan experiments, and decide when they should be run. You’ll need a couple of weeks after planning to allow engineering teams enough time to design, implement and test the experiments. Don’t forget to check if other teams have any key dependencies on the target environment and services over the Chaos Day, to avoid causing problems with experiments.
Do we have any major changes planned?
In general, the best time to schedule a Chaos Day is sufficiently far away from business changes to avoid disruption, but close enough to planned changes to provide useful insight and resolution time of problems that could impact those changes. For example, an online retailer could hold a Chaos Day two months before a seasonal peak in traffic, while a manufacturer could organise a Chaos Day to test its supply chain management before implementing a new shipping process.
It’s important to allow time between the Chaos Day and the business change, so that learning can be distilled and improvements applied. At one of our clients with a very large platform (1,000+ microservices processing 1 billion requests on peak days), we found that 2-3 Chaos Days per year was ideal.
Is our system currently stable?
Chaos Days offer important benefits, but if your systems are currently unstable and there are regular incidents and failures, you might already have enough chaos to contend with! In this instance, we advise focusing on improving post-incident reviews to bring about system stability. Once you’ve had a few months free of issues, then you could try a small Chaos Day in pre-production to test that stability.
Are the right people available to support our Chaos Day?
To get the best results from a Chaos Day you’ll need experienced engineers across relevant teams, to design and execute experiments. Ensure that these people are not currently committed to other tasks at the time you will need them for your Chaos Day.
How much time do we need for a Chaos Day?
To identify the best time to run a Chaos Day, first decide whether you need just a few hours, a whole day, or whether the event should be spread over several days.
Running a Chaos Day over a single day results in a more intensive but shorter event, which can add stress. Our experience is that this intensity also improves team dynamics and leads to a more memorable event. The disadvantage of a single Chaos Day is that it can be hard to maintain an element of surprise, especially in a pre-production environment. Teams need to be informed to treat failures in this environment as though production were on fire, so it’s likely they’ll be paying close attention.
Spreading a Chaos Day over a number of days makes it easier to spring surprises on the team, because people won’t know exactly when experiments will be run during the chosen period. It also allows for adjustments to be made to experiments, using learning from early experiments to improve later ones.
For more insight into the benefits of running a Chaos Day, along with expert guidance on how and when to organise Chaos Days for the maximum benefit, check out the playbook online here.