Escape ticketing hell with a shift to self-service platforms

Platform engineering creates user-centric capabilities that enable teams to achieve their business outcomes faster than ever before. At Equal Experts, we’ve been doing platform engineering for over a decade, and we know it can be an effective solution to many scaling problems.

Unfortunately, it’s easy to get platform engineering wrong. In this series, I’m covering some of its pitfalls. First, it was the power tools problem, then the technology anarchy problem, and today it’s the ticketing hell problem

Fewer handoffs, more speed

A platform engineering team aims to accelerate technology outcomes for all their teams, so they can deliver business outcomes faster than ever. I’ve explained before how those technology outcomes map onto the DORA metrics:

  • Speed (deploy lead time and deploy frequency)
  • Quality (unplanned tech work)
  • Reliability (deploy fail rate and time to restore)

At Equal Experts, we define engineering excellence as achieving high standards in all these metrics. One of the main blockers to excellence is handoffs. They occur when a team depends on another to complete a task, such as needing DBAs for schema creation, change managers for approvals, or your operations team for deployments. If you have handoffs in platform engineering, your teams will be stuck in ticketing hell.

The nightmare of ticketing hell

Ticket-driven platform capabilities can seem appealing. Your platform team can reuse existing workflows, and teams can simply file a ticket when they need something. However, this approach causes delays, as tasks are stuck in different queues e.g. triage, priorized, blocked. A task to create, deploy, or restart a service might take minutes to complete, but the queue beforehand can take days or weeks.

You can measure a platform capability in internal customer value, internal customer costs, and platform costs. Here’s a v1 platform with ticket-driven capabilities, and it’s a long way from our high value, low cost ideal:

  • Internal customer value is low. Technology outcomes can’t be significantly accelerated for teams, because there are so many handoffs and queue delays
  • Internal customer cost is low. Fortunately, teams don’t have much unplanned tech work, because everything is handled by the platform team
  • Platform costs are high. The platform team has to manage a lot of incoming tickets from teams, on top of actual build and run work for the platform

This pitfall happens when your organization has a long-term, ITIL-driven culture of centralized workflow management. Platform team workload will grow uncontrollably as you increase teams and platform capabilities. Ticket queues, conflicting priorities, and strained relationships will damage speed, quality, and reliability for your teams. 

For example, at a Dutch bank we saw a new employee wait 11 weeks for access to code repositories, because their request was stuck in the platform team’s queue. The employee felt unproductive, but they were reassured by their team it was standard practice.

And at an Australian telco, any deployment to any environment requires a platform ticket, which creates a combinatorial explosion. The platform team can’t keep up with demand, which results in blocked deployments, prioritization clashes, and platform engineer burnout. 

Embracing the power of self-service 

The wrong answer to ticketing hell is to embed platform engineers into teams, as it’s unsustainable and creates its own problems. The right answer is to standardize on self-service workflows with automated guard rails and audit trails. Here’s how your platform team can get started:

  1. Measure the end-to-end time for all v1 ticket workflows.
  2. Prioritize a task that’s frequently used, and frequently slow for teams.
  3. Create a v2 self-service, fully automated pipeline that consistently performs the task to a high standard, and logs an entry in an audit trail afterwards.
  4. Visualize the audit trail in a platform portal, so teams and their stakeholders can understand the impact of their actions on user behaviors.
  5. Migrate teams from the v1 workflow to the v2 solution for a single task, delete the v1 workflow.
  6. Move onto the next priority workflow.

Here’s v2 of that imaginary platform. There’s a much higher internal customer value for teams when they’re able to move quickly and meet business demand. Your platform engineering efforts can be shown to have accelerated outcomes.

Conclusion

Ticketing hell can cripple your platform engineering efforts by creating bottlenecks and frustration. By transitioning to self-service capabilities, you can unlock higher internal customer value and empower your teams to deliver business outcomes

I’m going to be sharing more platform engineering insights in my talk “Three ways you’re screwing up platform engineering and how to fix it” at the Enterprise Technology Leadership Summit Las Vegas on 20 August 2024. If you’re attending, I’d love to connect and discuss platform engineering challenges and solutions.

Platform engineering means creating user-centric capabilities that enable teams to achieve their business outcomes faster than ever before. At Equal Experts, we’ve been doing platform engineering for over a decade, and we know it can be an effective solution to many scaling problems.

Unfortunately, it’s easy to get platform engineering wrong. In this series, I’m covering some of its pitfalls. Last time it was the power tools problem, and today it’s the technology anarchy problem

Alignment and autonomy are needed

Teams need alignment and autonomy in product and technology to succeed at scale. Alignment connects vision, strategy, and execution, while autonomy empowers teams to act independently. Increasing alignment means better decisions, and increasing autonomy means faster decisions. Here’s a 2×2 from our alignment and autonomy 101

A two by two grid with autonomy on the x-axis and alignment on the x-axis. The grid contains sections for Autocracy (High Alignment, Low Autonomy) Aligned Autonomy (High Alignment, High Autonomy) Apathy (Low Alignment, Low Autonomy) and Anarchy (Low Alignment, High Autonomy)

Platform capabilities are a great way to share technical alignment with your teams. Baking alignment into capabilities such as deployment pipelines and observability dashboards makes engineering tasks much easier for teams. You might know this as an opinionated platform, paved roads, or golden paths. But when alignment is absent the magic doesn’t happen and your teams are in anarchy.

The long-term costs of technology anarchy 

When your platform capabilities offer autonomy without alignment, your teams can  quickly make technology decisions, but without guidance. In the short term, this allows teams to use familiar tech stacks to rapidly build services and deliver them to customers. However, in the long-term, it creates a fragmented ecosystem full of inefficiencies, staffing challenges, and a maintenance mountain. 

I’ve previously described how to measure capabilities in internal customer value, internal customer costs, and platform costs. Here’s a v1 platform that lacks technical alignment with three teams building their own pipelines in different tech stacks. There’s some internal customer value, and low platform costs as the platform team doesn’t have to build much. But internal customer costs will skyrocket over time.

Different teams have different looping paths, each with a different tech stack. On the right the impact of this is demonstrated with a horizontal bar chart showing the low platform costs, medium level of internal customer value and very high internal customer costs.

This pitfall occurs when these interconnected, incorrect beliefs exist in your organization:

  • Command and control is the only method of alignment
  • Alignment and autonomy are opposites
  • Teams must be in autocracy or anarchy
  • A platform team must impose strict rules, or have no rules at all

When your platform team chooses to build capabilities without technical alignment, it’s anarchy. Every team builds custom solutions, instead of leveraging shared capabilities. When a pipeline breaks, only one team can fix it. When a pipeline is enhanced, only one team benefits. When a team needs changing, people won’t want to move. When a team needs downsizing for maintenance mode, other teams won’t want to manage their services. There are no economies of scale. 

For example, an American telco invited me to visit their 20 teams, who had adopted You Build It You Run It for all their digital services on Google Cloud Platform (GCP). Team boards didn’t show much progress, and there were blank looks when I asked about their platform team. It didn’t exist, so 20 teams were using GCP in 20 different ways. When we measured the teams on unplanned tech work, we learned 40-60% of team time was GCP work. 

Similarly, a British retailer had 10 teams with 10 different RabbitMQ messaging solutions, until a single Pub/Sub solution was mandated for consistency. This meant 10 subtly different migrations, and there was a big dent in productivity and morale. More upfront technical alignment could have prevented all that unplanned tech work.

Achieving aligned autonomy

You escape the technology anarchy pitfall by replacing low alignment capabilities with aligned autonomy capabilities. This produces paved roads, which supply the friction-free guidance that teams need to make independent technology decisions in the same organizational direction.  

In the above 2×2 grid, you’ll see aligned autonomy is in the top right. It’s possible to have high alignment and high autonomy when you implement technical alignment as contextual guidelines, not top-down rules. Here’s how your platform team can make it happen:

  • Declare low alignment capabilities as v1, and restrict them to old services.
  • Capture guidelines on tech stack, architecture, etc. from engineering leadership in Architectural Decision Records (ADRs).
  • Ask teams to build services with the decision records built in.
  • Rebuild v1 capabilities with decision records built in, and declare them as v2.
  • Host new services on v2.
  • Migrate old services to v2.
  • Delete v1.

Your platform team needs to stay user-centered, and focus on a great platform experience. It might mean a customizable pipeline template for a single tech stack, a four golden signals observability dashboard, or an automated ServiceNow workflow. There’s a higher platform cost, but you minimize internal customer costs and boost internal customer value. That’s a good trade-off. Here’s v2 of that imaginary platform, showing the same three teams with one pipeline template for their common Python tech stack.

Multiple different pipelines with different technology stacks have been replaced by clear paved roads built on one technology stack. The impact of this is shown in a comparison horizontal bar chart showing an increase in customer value and platform costs and a decrease in internal customer costs.

Conclusion

Technology anarchy is a dangerous pitfall with painful, long-term consequences for your organization. If you reject the false dichotomy of alignment and autonomy as opposites, your platform team can create aligned autonomy in platform capabilities and help your teams to achieve engineering excellence.

I’ll share more platform engineering insights in my talk “Three ways you’re screwing up platform engineering and how to fix it” at the Enterprise Technology Leadership Summit Las Vegas on 20 August 2024. If you’re attending, I’d love to connect and hear about your platform engineering challenges and solutions.

Platform engineering means creating user-centric capabilities that enable teams to achieve their business outcomes faster than ever before. At Equal Experts, we’ve been doing platform engineering for a decade, and we know it can be an effective solution to many scaling problems. 

Unfortunately, it’s easy to get platform engineering wrong. There are plenty of pitfalls, which can contaminate your engineering culture and prevent you from sustainably scaling your teams up and down. In this series, I’ll cover some of those pitfalls, starting with the power tools problem.

How to measure a platform capability

A platform capability mixes people, processes, and tools (SaaS, COTS, and/or custom code) to provide one or more enabling functions to your teams. In order to stay user-centered and focussed on your mission, you need to measure a capability in terms of: 

  • Internal customer value. How much it improves speed, reliability, and quality for your teams. The higher this is, the faster your teams will deliver.
  • Internal customer costs. How much unplanned tech work it creates for your teams. The lower this is, the more capacity your teams will have.
  • Platform costs. How much build and run work it creates for your platform team. The lower this is, the fewer platform engineers you’ll need.

Whether it’s data engineering or a microservices architecture, it’s all too easy for your well-intentioned platform team to make the wrong trade-offs, and succumb to a pitfall. Here’s one of those tough situations. 

The hidden costs of power tools

Implementing core platform capabilities with power tools like Kubernetes, Kafka, and/or Istio is one of the biggest pitfalls we regularly see in enterprise organizations. Power tools are exciting and offer a lot of useful features, but unless your service needs are complex and your platform team knocks it out of the park, those tools will require a lot more effort and engineers than you’d expect. 

Here’s a v1 internal developer platform, which uses Kubernetes for container orchestration, Kafka for messaging, and Istio for service mesh. A high level of internal customer value is possible, but there are also high internal customer costs and a high platform cost. It’s time-consuming to build and maintain services on this platform.

Version1 of an internal developer platform. A large and heavy weight containing Kubernetes, Istio and Kafta capabilities. On the right is a horizontal bar chart showing the high levels of internal customer value, internal customer costs and platform costs of heavyweight power tools.

This pitfall happens when your platform team prioritizes the tools they want over the capabilities your teams need. Teams will lack capacity for planned product work, because they have to regularly maintain Kubernetes, Kafka, and/or Istio configurations beyond their core competencies. And your platform team will require numerous engineers with specialized knowledge to build and manage those tools. Those costs aren’t usually measured, and they slowly build up until it’s too late.

For example, we worked with a Dutch broadcaster whose teams argued over tools for months. The platform team wanted Kubernetes, but the other teams were mindful of deadlines and wanted something simpler. Kubernetes was eventually implemented, without a clear business justification. 

Similarly, a German retailer used Istio as their service mesh. The platform team was nervous about upgrades, and they waited each time for a French company to go first. There was no business relationship, but the German retailer had a documented dependency on the French company’s technology blog.

Transitioning from heavyweight to lightweight tools

You escape the power tools pitfall by replacing your heavyweight capabilities with lightweight alternatives. Simpler tools can deliver similar levels of internal customer value, with much lower costs. For example, transitioning from Kubernetes to ECS can reduce internal customer costs as teams need to know less and do less, and also lower your platform costs as fewer platform engineers are required. 

Here’s a simple recipe to replace a power tool with something simpler and lower cost. For each high-cost capability, use the standard lift and shift pattern:

  • Declare it as v1, and restrict it to old services
  • Rebuild v1 with lightweight tools, and declare that as v2
  • Host new services on v2
  • Lift and shift old services to v2
  • Delete v1

As with any migration, resist the temptation to put new services onto v1, and design v2 interfaces so migration costs are minimized. Here’s v2 of the imaginary developer platform, with Fargate, Kinesis, and App Mesh replacing Kubernetes, Kafka, and Istio. Capability value remains high, and costs are much lower.

The heavy weight containing platform capabilities in version 1 has been transitioned to lightweight platform capabilities, demonstrated in v2 with App mesh, Kinesis and Fargate in bubbles. The impact of this is shown in a horizontal bar chart comparing the high internal customer and platform costs of the heavyweight capabilities with the lower costs in the lightweight system.

Conclusion

Power tools are a regular pitfall in platform engineering. Unless your platform team can build and run them to a high standard, they’ll lead to a spiral of increasing costs and operational headaches. Transitioning to lighter, more manageable solutions means you can achieve a high level of internal consumer value as well as low costs. 

A good thought experiment here is “how many engineers want to build and run a Kubernetes, Kafka, or Istio a second time?”. My personal experience is not many, and that’s taking managed services like EKS and Confluent into account.

I’ll share more platform engineering insights in my talk “Three ways you’re screwing up platform engineering and how to fix it” at the Enterprise Technology Leadership Summit Las Vegas on 20 August 2024. If you’re attending, I’d love to connect and hear about your platform engineering challenges and solutions.

DORA and Accelerate (Forsgren et al.) define “Lead Time for Change” as “the amount of time it takes a commit to get into production.”

By being specific about how and when you take the measurements, you can create a Deployment Lead Time metric that can help your platform team identify improvements to reduce Lead Time for Change across multiple teams.

Change = Deployments || Releases, but Deployments != Releases

Software changes are done in organisational events such as releases, or can happen frequently throughout the day such as deployments. However, releases often require collaboration with enabling teams such as marketing, legal, and customer operations to ensure a successful outcome, and are a form of organisational change events. Deployments are technical change events that do not require the same level of collaboration across the organisation. They can be done frequently throughout the day and with sufficient preparation, such as using feature flagging capabilities. They do not cause incidents that impact the availability or the user experience. 

Release lead time, or cycle time will vary significantly depending on how the organisation has optimised for the flow of work, and significantly reducing the cycle lead time can be outside the scope of the interactions between a platform team and the product teams it works with. Deployment lead time however, can be optimised by interactions between a platform team and the stream-aligned product teams it works with. 

Measuring deployment lead time will provide information on the common path to production across teams, whilst the measurement of cycle time will inform on the ways of working of the team and the other organisational activities that have to happen to release change to users.

Deployment lead time is good for comparing across teams without getting too involved in specific teams’ ways of working

If your platform team aims to optimise the path to production across many teams, prefer starting the Deployment Lead Time measurement from when the commit hits the main branch until that commit is deployed to production deployment (Commit D in the diagram). 

By measuring when the work is ready to go to production, we gain accurate data on the process and pipelines required for the path to production that is easily comparable across teams and there is a reduction in bias for specific team’s ways of working like branching strategies, peer-review approach, testing strategy.

When measuring from the first commit of the branch (Commit A in the diagram), we’ll produce the team’s cycle time measurement. It’ll include the time it takes to produce the work, integrate that work with others, and peer-review it (if a separate stage). 

The timing of the first commit being created can be easily gamed by engineers. Still, by avoiding measuring from the first commit for deployment lead time, we leave individual team preferences and ways of working alone and measure when the work from that team is ready for production.

Mean averages are good but do pay attention to your 50th, 90th, and 95th percentiles

Watch the median (50th percentile) to understand how you’re doing with most of your changes getting to production, and the long-tail percentiles (90th, 95th) to understand what happens when things are weirder than usual on their journey to production.

When making changes to production are happening quickly and safely, with a team that has a good understanding of how that software operates, you’ll find your long tail comes towards your median.

How to measure Lead Time for Change

There are many potential points in a typical developer’s workflow that you can use to measure how long it takes for a commit to get into production, be wary of accidentally measuring cycle time, pipeline time, or time to create value. 

Instead, measure Deployment Lead Time to ensure your platform team can take action on the metric’s results and meaningfully impact it by changing the user experience of the product teams.