Our Thinking Wed 31st January, 2024
Setting up in the off-season for support success
If you want to do YBIYRI support during seasonal traffic events, set up for success by building operational muscle memory with YBIYRI during office hours year round, rehearsing incident response, and proactively investigating failures.
What is seasonal service traffic?
Not all digital services experience a consistent and steady demand from their users. Some services may have seasonal peak traffic events such as Self Assessment, Black Friday/Cyber Monday or Valentine’s Day. During these seasonal events, traffic to the service significantly increases compared to the rest of the year.
When seasonal traffic is vastly different from off-season traffic levels, product managers may find it challenging to justify the need for out-of-hours (OOH) support during the off-season, leading to year-round underinvestment in operability and incident response activities. But by taking the following steps, you can prepare and build muscle memory during the off-season whilst creating flexibility in future support operating choices – ensuring that you can keep service metrics like Mean Time To Recover (MTTR) consistently low year-round.
What is YBIYRI?
“You Build It You Run It” (YBIYRI) is an operating model that involves product teams building, deploying, operating, and supporting their digital services. Many organisations adopt a YBIYRI support model with their product teams to reduce the impact of incidents on the organisation’s MTTR, achieve higher deployment throughput, and ensure greater service reliability. This approach also fosters a culture of continuous learning.
Depending on the impact of service incidents (financial, legal, regulatory, or reputational), a product manager may choose to select a YBIYRI support model for their services to act as insurance against lower availability and minimise the impact on their product’s users.
During peak traffic seasons, it’s often easier to justify OOH YBIYRI support. However, during the off-season, it may be hard to financially justify that same OOH support as the level of availability protection is only needed during office hours. And that’s perfectly okay.
Setting up for success in the off-season
By taking the below steps in your off-season, you’ll be able to enjoy the benefits of YBIYRI during seasonal traffic events.
Keep tooling and incident response processes the same between office hours and out-of-hours
Use the same synchronous incident alerting tooling (PagerDuty, VictorOps, etc) to notify incident responders year-round.
During office hours, continue to use a YBIYRI support model
Product teams will be continually incentivised to invest in availability and operability all year round, resulting in less of a last-minute scramble to implement operational improvements for a lower MTTR during seasonal events.
For out-of-hours support, either:
- Where traffic is still meaningfully present, shift support to an operations team with an Ops Run It support model for triage and response with appropriate handover between product team and operations teams at the start and end of office hours.
- When traffic is minimal in the off-season and incident impact is low, delay triage and response to incidents until office hours for the owning product team.
Rehearse and practice incident response
Providing a safe environment for product teams to practise incident response processes and using tooling to debug production-like issues is key to surfacing areas for continual improvement and creating consistently lower MTTR. Consider running rehearsal sessions using a small special-built digital service that has subtle ways of breaking and presenting the incident to teams, further breaking the service as the incident evolves and resolving factors causing failure. This will provide an opportunity for product teams to practise using the incident response process using more in-depth incidents, without the prior knowledge and context of a service they’re intimately involved in building whilst surfacing gaps in the process.
Proactively investigate failures
Failures in a service can happen at any time. Modern digital services operate in an ever-changing environment regardless of seasonal traffic patterns. To be prepared for such incidents, it’s important to proactively investigate potential failures by adopting practices such as chaos engineering and chaos days. This can help you establish effective mitigation strategies and document the best steps for triaging and responding to a failure ahead of time, lowering your MTTR.
By keeping tooling choices and incident response processes the same, product team members can build and maintain incident response muscle memory all year round, reducing the time it takes to onboard new team members or operationally prepare ahead of seasonal events. Additionally, ad-hoc YBIYRI support is still possible in the off-season.
Final thoughts
Organisations and products needing YBIYRI during seasonal traffic events can set up for success in the off-season by practising office-hours YBIYRI to keep their operational muscle memory strong, as well as keeping tooling and incident response processes the same, rehearsing and practising incident response, and proactively investigating failures.