Our Thinking Thu 27th May, 2021
Six ‘must have’ key principles of data pipeline projects
Facing an ever-growing set of new tools and technologies, high functioning analytics teams have come to rely increasingly on data engineers. Building and managing production data engineering pipelines is an inherently complex process, which can prove hard to scale without a systematic approach.
To help navigate this complexity, we have compiled our top advice for successful solutions. Here we examine some of the key guiding principles to help data engineers (of all experience levels) effectively build and manage data pipelines. These have been compiled using the experience of the data engineers at Equal Experts. They collectively recommend the adoption of these principles as they will help you lay the foundation to create sustainable and enduring pipelines.
About this series
This is part three in our six part series on the data pipeline, taken from our latest playbook. First we looked at the basics, in What is a data pipeline. Next we looked at the six main benefits of an effective data pipeline. Now we consider the “must have” key principles of data pipeline projects. Before we get into the details, we just want to cover off what’s coming in the rest of the series. In part four, we look at the six key practices needed for a data pipeline. In part five we investigate more of those practices, and in part six we look at the many pitfalls you can encounter in a data pipeline project.
The growing need for good data engineering
If I have learned anything from my years working as a data engineer, it is that practically every data pipeline fails at some point. Broken connections, broken dependencies, data arriving too late, or unreachable external systems or APIs. There are many reasons. But, regardless of the cause, we can do a lot to mitigate the impact of a data pipeline’s failure. These ‘must have’ principles are built up over the years to help to ensure that projects are successful. They are based on my knowledge, and the Equal Experts team’s collective experience, gained across many data pipeline engagements.
Data pipelines are products
Pipelines bring data from important business sources. In many cases, they feed reports and analyses that endure for a long time. Unless your business does not expect to alter how it operates, or there are no amendments to low-level processes, the data pipelines will always need to adapt to the changes in the fundamental processes, new IT, or the data itself. As something that should respond to and embrace regular change, pipelines should be treated as products rather than projects.
This means that there should be multi-year funding to monitor and maintain the existing pipelines. Providing headroom to add new ones, and supporting the analysis and retirement of old ones. Pipelines need product managers to understand the pipelines’ current status and operability, and to prioritise the work. (See this Forbes article for a wider description of working in product-mode over project-mode.)
Find ways for making common use of the data
The data collected for a given problem or piece of analysis will nearly always be useful in answering other questions. When creating pipelines, we try to architect them in a way that allows reuse, whilst also remaining lean in our implementation choices.
In many cases there are simple ways of achieving this. For example, there are usually a variety of places where data is stored in the pipeline. Raw ingested data might be useful for unanticipated purposes. And it can often be made available to skilled users by providing them access to the landing zone.
Appropriate identity and access technologies, such as role-based access, can support reuse while permitting strict adherence to data-protection policies and regulations. The fundamental architecture can stay the same, with access being provided by adding or amending access roles and permissions to data buckets, databases or data warehouses.
A pipeline should operate as a well-defined unit of work
Pipelines have a cadence driven by the need for decision-making and limited by the availability of source data. The developers and users of a pipeline should understand and recognise this as a well-defined unit of work – whether every few seconds, hourly, daily, monthly or event-driven.
Pipelines should be built around use cases
In general, we recommend building pipelines around the use case rather than the data source. This will help ensure that business value is achieved early. In some cases, the same data source might be important to several use cases, each with different cadences and access rights. Understanding when to reuse parts of pipelines and when to create new ones is an important consideration. For example, faster pipelines can always be used for slower cadences, but it typically requires more effort to maintain and adapt them. It might be simpler to create a simpler batch pipeline to meet a new low-latency use case that is not expected to change substantially than to focus on upgrading a fast-streaming pipe to meet the new requirements.
Continuously deliver your pipelines
We want to be able to amend our data pipelines in an agile fashion as the data environment and needs of the business change. So, just like any other piece of working software, continuous delivery practices should be adopted to enable continuous updates of data pipelines in production. Adopting this mindset and these practices is essential to support continuous improvement and create feedback loops that rapidly expose problems and address user feedback.
Consider how you name and partition your data
Data pipelines are a mix of code and infrastructure that can become confusing as they grow if care is not taken with the naming. Pipelines will include at least a set of databases, tables, attributes, buckets, roles, etc., and they should be named in a consistent way to facilitate understanding and maintenance of the pipelines, as well as make the data meaningful to the end-users.
In many architectures, naming will directly affect how your data is partitioned, which in turn affects the speed of the search and retrieval of data. Consider what will be the most frequent queries when specifying bucket names, table partitions, shards, and so on.
Want to know more?
These guiding principles have been born out of our engineers and use each of their 10+ years of data engineering for end-to-end machine learning solutions. We are sure there are lots of other principles, so please do let us know of any approaches you have found effective in managing data pipelines.
In our next blog post in this series we will start laying out some of the key practices of data pipelines. Until then, for more information on data pipelines in general, take a look at our Data Pipeline Playbook.
Contact us!
If you’d like us to share our experience of data pipelines with you, get in touch using the form below.