Our Thinking Mon 10th June, 2024
Why unplanned tech work is a silent killer
I speak with customers across the Equal Experts network, to understand their scaling problems. I’m sometimes asked ‘why don’t my teams have enough capacity to deliver’, or ‘what’s the business benefit of standardization’, and unplanned tech work is usually involved in both. It has a proven link to poor technical quality, and if you don’t control it, it’ll silently kill team capacity.
I once visited a British telco where 20 autonomous teams were building digital services on GCP. A delivery manager said “our teams do You Build It You Run It, they frequently deploy, but features are slow to reach customers and we don’t know why”. When I suggested measuring team capacity as value-adding and non-value-adding work, the manager predicted an 80/20 split. They were horrified when an analysis revealed a 20/80 split, with teams averaging 60% of their time on unplanned tech work!
What is unplanned tech work, why was it silently killing team capacity at this telco, and how did they start to turn the situation around?
The impact of unplanned tech work
Team capacity can be divided into product work and tech work. Product work is about delivering value-add, to satisfy business demand and user needs. Planned tech work is proactive tech initiatives, and routine BAU maintenance. Unplanned tech work is also known as break/fix or rework, and consists of reactive fixes, patching, and emergency maintenance. It includes configuration errors, defects, deployment failures, environment issues, security vulnerabilities, and test failures.
Unplanned tech work harms technical quality and delivery speed. More unplanned tech work means more quality problems and slower feature delivery. This was proven by Dr. Nicole Forsgren et al in Accelerate, which showed high-performing organizations spend 29% less time on unplanned tech work. It also found that continuous delivery practices like frequent deployments are a predictor of low unplanned tech work.
At the British telco, their high levels of unplanned tech work were due to a lack of technology alignment. Without guidelines from senior leadership, the 20 teams had created 20 unique stacks in Go, Java, and Python hosted on App Engine, Cloud Run, and Kubernetes. There was no standardization at all, and the majority of deployments were solely configuration and infrastructure fixes. Teams were running just to stand still, but why hadn’t this been noticed before?
Why unplanned tech work is a silent killer
Unplanned tech work silently consumes team capacity because it’s:
- Unmeasured. Teams tend to measure quality as outputs, like code coverage or defect count. They don’t look at outcomes, like the rate of break/fix work and its impact on delivery speed
- Unautomatable. There are many sources of break/fix work, and it’s hard to create automated measurements of toil because break/fix work occurs inside everyday tasks
- Unquestioned. Break/fix work becomes accepted as the norm over time. Teams overlook recurring problems, leading to the normalization of deviance where the abnormal becomes the normal
Unplanned tech work can be measured as rework rate percentage, which comes from Accelerate. Trying to automate it by analyzing Jira tickets and/or Git commits is time-consuming, and misses a lot of break/fix work. It’s more cost effective to ask engineers to estimate what percentage of their time is spent on unplanned tech work. That estimate should be understood as a high level percentage, not a detailed breakdown. Micro-management is counterproductive, and there’s too much variability for accurate forecasting.
At the British telco, a weekly Google Form survey was introduced to ask engineers a single question ‘what % of your time last week was spent on reactive fixes and emergency maintenance’, with a picklist for worst offenders. The survey highlighted teams with the most rework, identified Kubernetes as a regular break/fix source, and produced actionable insights such as one tech lead reporting “30% of my time every week is spent reconfiguring Kubernetes, it’s not our core competency”.
How to control unplanned tech work
You can’t eliminate unplanned tech work, but it can be controlled:
- Identify major sources of unplanned tech work. Use team feedback to observe trends in unplanned tech work, and prioritize the jobs to be done
- Create technical alignment. Ask your senior leaders to set contextual technology guidelines. This might include standardization on tech stacks, ways of working, or measures of success
- Implement technical alignment as paved roads. Empower a platform engineering team to create friction-free, self-service capabilities that are fast and fault-free with zero maintenance for teams
- Prioritize technical quality alongside functionality. Encourage teams to treat quality as the foundation of a great user experience, and build quality into digital services from the outset
As always, we recommend a test-and-learn approach focused on outcomes. Don’t spend months building an unwanted platform. Don’t standardize everything for the sake of it. Run short experiments to eliminate the largest source of break/fix work. When teams report improvements, move onto the next largest source, and continue until teams achieve a consistently low level of unplanned tech work.
At the British telco, the major break/fix sources were Kubernetes underprovisioning and GCP misconfiguration. A platform engineering team was formed to create migration routes from App Engine and Kubernetes onto Cloud Run, and a self-service Cloud Run deployment pipeline. There was a substantial drop in break/fix work, and the artificially high deployment speed slowed down. This resulted in more capacity for teams to focus on value delivery.
Conclusion
When teams don’t have enough delivery capacity, it’s possible they’re suffering from too much unplanned tech work. This means reactive fixes and emergency maintenance work, and it’s usually unmeasured. Asking teams to regularly estimate the percentage of their time spent on unplanned tech work can quickly yield valuable insights.
Don’t build an unwanted platform, and don’t standardize for the sake of it. Focus on the desired outcome, unlock opportunities to eliminate sources of break/fix work, and standardize where appropriate to make incremental improvements. This will drive up technical quality, and allow teams to work towards engineering excellence.