Let there be light! Use these IT fault analysis techniques to see systems differently

Chaos Days are a great way to learn about and improve an IT system’s resilience, by deliberately invoking failures in a controlled manner and then observing their impact. 

This article describes a set of practices for identifying which failures to explore. These practices have origins in process engineering, but are equally valid in the IT domain and can be used at any point in the development lifecycle.

How does this help?

  • Predicting error conditions can help design mitigations. In most IT this is usually about improving the user experience of a failure, but in safety critical applications it can extend to avoiding injury or even death!
  • Improving knowledge of a system’s behaviour can inform where to get most value from automated and manual testing.
  • Knowing which failures are predictable can be used to improve monitoring and alerting.
  • It produces a list of expected behaviour under failure conditions that can help prioritise which failures can be safely explored during a Chaos Day, and which should be avoided due to their impact being too severe.

Fault Analysis Techniques

The four techniques are

  • Functional Failure Analysis
  • Failure Mode and Effects Analysis (FMEA)
  • Fault Tree Analysis
  • Hazard and Operability Study (HAZOPs, for once an “x-Ops” that isn’t just adding operations!)

Each of these is a team exercise involving stakeholders and subject matter experts (and a formal HAZOPs requires particular participants to meet regulatory requirements). Whilst in the process engineering world we would create some documentation about the safety of a chemical plant (for example), it’s the process of thinking about the system from various viewpoints that is of most interest to me.

Let’s take a brief look at each one.

Functional Failure Analysis

This can be done very early on, without even an architecture defined, because what we are considering is the effect on the system of a failure of one or more of the functions of that system.

For example, when thinking about a car as a system, you could consider the functions of “steering”, “braking”, “acceleration” etc.  all without any idea of how those functions are implemented. Likewise, a retail website would have functions of “displaying goods”, “searching for products”, “placing orders” and so on – relatively easy to enumerate, whilst the design is no more than a handful of boxes and lines.

So the failure analysis part is to ask of each function – how could this function fail, and what are the consequences of that failure?  As a prompt for this question we need to ensure that we consider a) loss of function, b) function provided when not required, and c) incorrect function.

This practice has largely been replaced by FMEA (and is sometimes referred to as a functional FMEA) but is still a useful technique for examining the system in a generic fashion because we can apply it early in the lifecycle, or with a group of stakeholders who don’t know or care about the system detail.

Failure Mode and Effects Analysis

In order to do this analysis, we need to have the architecture defined – at least to the point where we have a separation of the components of the system.  It can be done at different levels of detail, and the more detailed the examination the finer grained the effects can be.

Consider the system as a combination of components, then take each component and detail what the effect would be on the system if there were a failure of that component. In this case, we have some detail about the component so we can consider the realistic ways in which it can fail.

For instance, in a car, if the power steering component fails – the steering overall does not fail but it can be more difficult for the driver to steer, so we should indicate a power steering failure to them. Whereas, if the steering arm broke on one side then the steering would be significantly impaired; in this case a regular MOT checks the condition of the part.

In a retail website made of microservices, if the payment service failed then you wouldn’t be able to take any orders; if it was the address lookup service that failed, it would minorly inconvenience customers and potentially lead to misaddressed orders.

Fault Tree Analysis

In fault tree analysis we flip the whole thing upside-down and start instead with the possible externally detectable ways the system can fail. Then, for each fault, we follow a process a bit like the “5 whys” – but instead of asking “why did this happen?” we ask “what could cause this?”.  We aren’t looking for a root cause, but to build a tree of things that could happen in the system that could contribute to a fault.

Importantly, some of those things will be systems doing what they should be doing but in combination with something else not working properly.

This tree of causes is arranged not as a simple tree, but each node is associated to its branches with a logical condition; it can then be used to identify points where the fault tree can be disrupted so that a particular combination of conditions does not result in a fault.

HAZOPs

Last, but certainly not least, is the HAZOP process.

There have been various attempts to codify this to apply to software specifically – but those have been formal safety based regulatory-compliance drives.  If the process is considered more for the value of the conversation it stimulates, then it’s easier to apply it.

In simple terms, the HAZOP process examines the flow of material through the system (clearly in process engineering, that’s the real flow of chemicals etc. – in software it would be data). At each point where chemicals are used or data is processed, we use a bunch of keywords to guide our thinking on what might happen: 

  • “No or not” 
  • “Other than”
  • “More”
  • “Early”
  • “Less”
  • “Late”
  • “As well as”
  • “Before”
  • “Part of”
  • “After”
  • “Reverse (of intent)”

It’s fairly easy to see what each means in process engineering – but each can be readily applied to data flows as well.

If you think of an API – what happens when the client receives data “other than” it was expecting? An asynchronous process receives data “Late” or “After” some other data. Each keyword can be applied with a bit of creativity to software data flows.

In each case you would consider the causes of the condition, the consequences for the system and the action that might need to be taken.

Summary

Whilst I’ve only really scratched the surface of these techniques, I hope I’ve highlighted some of the advantages of examining your system from the variety of perspectives they encourage, and illustrated how you can use them.

Find out more

Many of the texts around these topics are process engineering focused, some are 40 years old (though still relevant), and expensive (many are three figures).

Some good ones are suggested reading on these course modules from the Safety Critical Masters degree course at the University of York:

For more details of the techniques without delving too deep into your expense budgets, the wikipedia pages are fairly well detailed:

Finally our open-source, Chaos Day playbook provides lots of useful guidance on why, how and when to run a Chaos Day.