Managing the flow of information from a source to the destination system forms an integral part of every enterprise looking to generate value from their data.
Data and analytics are critical to business operations, so it’s important to engineer and deploy strong and maintainable data pipelines by following some essential practices.
This means there’s never been a better time to be a data engineer. According to DICE’s 2020 Tech Job Report, Data Engineer is the fastest-growing job in 2019, growing by 50% YoY. Data Scientist is also up there on the list, growing by 32% YoY.
But the parameters of the job are changing. Engineers now provide guidance on data strategy and pipeline optimisation and, as the sources and types of data become more complicated, engineers must know the latest practices to ensure increased profitability and growth.
In our data pipeline playbook we have identified eleven practices to follow when creating a data pipeline. We touched on six of these practices in our last blog post. Now we talk about the other five, including iteratively creating your data models as well as observing the pipeline. Applying these practices will allow you to integrate new data sources faster at a higher quality.
About this series
This is part five in our six part series on the data pipeline, taken from our latest playbook. First we looked at the basics, in What is a data pipeline. Next we looked at the six main benefits of an effective data pipeline. Next we considered the “must have” key principles of data pipeline projects and in part four, we looked at the six key practices needed for a data pipeline. Now we go into details of more of those practices, before finishing off our series in part six with a look at the many pitfalls you can encounter in a data pipeline project.
Practice Seven: Observe the pipeline
Data sources can suddenly stop functioning for many reasons – unexpected changes to the format of the input data, an unanticipated rotation of secrets or change to access rights, or something happens in the middle of the pipeline that drops the data. This should be expected and means of observing the health of data flows should be implemented. Monitoring the data flows through the pipelines will help detect when failures have occurred and prevent adverse impacts. Useful tactics to apply include:
- Measuring counts or other statistics of data going in and coming out at various points in the pipeline.
- Implementing thresholds or anomaly detection on data volumes and alarms when they are triggered.
- Viewing log graphs – use the shapes to tell you when data volumes have dropped unexpectedly.
Practice Eight: Data models are important and should be addressed iteratively
For data to be valuable to the end users (BI teams or data scientists), it has to be understandable at the point of use. In addition, analytics will almost always require the ability to merge data from sources. In our experience, many organisations do not suffer from big data as much as complex data – with many sources reporting similar or linked data – and a key challenge is to conform the data as a step before merging and aggregating it.
All these challenges require a shared understanding of data entities and fields – and need some kind of data model to resolve to. If you ignore this data model at the start of the pipeline, you will have to address these needs later on.
However, we do not recommend the development of an enterprise data model before data can be ingested into the system. Rather, starting with the needs of the data users in the initial use cases will lead you to a useful data model that can be iterated and developed over time.
Practice Nine: Apply master data/reference data pragmatically to support merging
Most pipelines require data to be conformed not just to the schema but also against known entities such as organisational units, product lists, currencies, people, companies, and so forth. Ignoring this master data on ingestion will make it harder to merge data later on. However, master data management often becomes overwhelming and starts to seem as if the whole enterprise needs modelling. To avoid data analysis paralysis, we recommend starting from the initial use cases and iteratively building reference data and master data into the pipelines as they are needed.
Practice Ten: Use orchestration and workflow tools
Pipelines typically support complex data flows composed of several tasks. For all but the simplest pipelines, it is good practice to separate the dataflow from the code for the individual tasks. There are many tools that support this separation – usually in the form of Directed Acyclic Graphs (DAGs). In addition to supporting a clear isolate and reuse approach, and enabling continuous development through providing version control of the data flow, DAGs usually have a simple means of showing the data dependencies in a clear form, which is often useful in identifying bugs and optimising flows.
Depending on the environment and the nature and purpose of the pipeline, some tools we have found useful are:
- Apache Airflow
- dbt
- Argo Workflows
- DVC
- Dagster
- AWS Glue
Practice Eleven: Continuous testing
As with any continuous delivery development, a data pipeline needs to be continuously tested. However, data pipelines do face additional challenges such as:
- There are typically many more dependencies such as databases, data stores and data transfers from external sources, all of which make pipelines more fragile than application software – the pipes can break in many places. Many of these dependencies are complex in themselves and difficult to mock out.
- Even individual stages of a data pipeline can take a long time to process – anything with big data may well take hours to run. Feedback time and iteration cycles can be substantially longer.
- In pipelines with Personally Identifiable Information (PII), PII data will only be available in the production environment. So how do you do your tests in development? You can use sample data which is PII-clean for development purposes. However, this will miss errors caused by unexpected data that is not in the development dataset, so you will also need to test within production environments – which can feel uncomfortable for many continuous delivery practitioners.
- In a big data environment, it will not be possible to test everything – volumes of data can be so large that you cannot expect to test against all of it.
We have used a variety of testing practices to overcome these challenges:
- The extensive use of integration tests – providing mock-ups of critical interfaces or using smaller-scale databases with known data to give quick feedback on schemas, dependencies and data validation.
- Implementing ‘development’ pipelines in the production environment with isolated ‘development’ clusters and namespaces. This brings testing to the production data, avoiding PII issues, and sophisticated data replication/emulation across environments.
- Statistics-based testing against sampled production data for smaller feedback loops on data quality checks.
- Using infrastructure-as-code testing tools to test whether critical resources are in place and correct (see https://www.equalexperts.com/blog/our-thinking/testing-infrastructure-as-code-3-lessons-learnt/ for a discussion of some existing tools).
Hopefully this gives a clearer overview of some of the essential practices needed to create an effective data pipeline. In the next blog post in this series, we finish our series by looking at the many pitfalls you can encounter in a data pipeline project. Until then, for more information on data pipelines in general, take a look at our Data Pipeline Playbook.
Contact us!
If you’d like us to share our experience of data pipelines with you, get in touch using the form below.