Our Thinking Mon 20th September, 2021
Bringing TDD to the Data Warehouse
As a follow-up from Language Agnostic Data Pipelines, the following post is focused on the use of dbt (data build tool).
Dbt is a command-line tool that enables us to transform the data inside a Data Warehouse by writing SQL select statements which represent the models. There is also a paid version with a web interface, dbt cloud, but for this article let’s consider just the command-line tool.
The intent of this article is not to make a tutorial about dbt – that already exists here, nor one about TDD, the goal is to illustrate how one of our software development practices, test-driven development, can be used to develop the dbt models.
Testing strategies in dbt
Dbt has two types of tests:
- Schema tests: Applied in YAML, returns the number of records that do not pass an assertion — when this number is 0, all records pass and therefore your test passes.
- Data tests: Specific queries that return 0 records.
Both tests can be used against staging/production data to detect data quality issues.
The second type of test gives us more freedom to write data quality tests. These tests run against a data warehouse loaded with data. They can run on production, on staging, or for instance against a test environment where a sample of data was loaded. These tests can be tied to a data pipeline so they can continuously test the ingested and transformed data.
Using dbt data tests to compare model results
With a little bit of SQL creativity, the data tests (SQL selects) can be naively* used to test model transformations, comparing the result of a model with a set of expectations:
with expectations AS ( select 'value' as column1, union all Select 'value 2' as column1 ) select * from expectations except select * from analytics.a_model
The query returns results when the expectations differ, so in this case dbt reports a test failure. However, this methodology isn’t effective to test the models due to the following facts:
- The test input is shared among all the tests (this could be overcome by executing dbt test and the data setup for each test, although it’s not practical due to the lack of clarity and the maintainability of test suites).
- The test input is not located inside the test itself, so it’s not user friendly to code nor easy to understand the goal of each test.
- The dbt test output doesn’t show the differences between the expectations and the actual values, which slows down the development.
- For each test, we need to have a boilerplate query with the previous format (with expectations as…).
Considering these drawbacks, It doesn’t seem like the right tool to make model transformation tests.
A strategy to introduce a kind of ‘data unit tests’
It’s possible and common to combine SQL with the templating engine Jinja (https://docs.getdbt.com/docs/building-a-dbt-project/jinja-macros). Also, It’s possible to define macros which can be used to extend dbt’s functionalities. That being said, let’s introduce the following macro:
unit_test(table_name, input, expectations)
The macro receives:
- A table name (or a view name).
- An input value that contains a set of inserts.
- A table of expectations.
To illustrate the usage of the macro, here is our last test case refactored:
{% set table_name = ref('a_model') %} {% set input %} insert into a_table(column1) values (‘value’), (‘value2’); {% endset %} {% set expectations %} select 'value' as column1, union all select 'value 2' as column1 {% endset %} {{ unit_test(table_name, input, expectations) }}
There is some boilerplate when using Jinja to declare the variables to call the unit test macro. Although, it seems a nice tradeoff, because this strategy enables us to:
- Simplify the test query boilerplate.
- Setup input data in plain SQL and in the same file.
- Setup expectations in plain SQL and in the same file.
- Run each test segregated from other tests.
- Show differences when a test fails.
To illustrate the usage of this approach, here is a demo video:
The previous macro will be available in the repo published with the Language Agnostic Data Pipelines.
*naively coded because the use of EXCEPT between both tables fails to detect if duplicate rows exist. It could be fixed easily, but for illustrative purposes, we preferred to maintain the example as simple as we can.
Bringing software engineering practices to the data world
It is also easy to apply other standard software development practices such as integration with a ci/cd environment in dbt. This is one of the advantages of using it over transforming data inside ETL tools which use a visual programming approach.
Wrapping up, we advocate that data oriented projects should always use the well-known software engineering best practices. We hope that this article shows how you can apply TDD using the emerging DBT data transformation tool.
Pedro Sousa paired on this journey with me. He is taking the journey from software engineering to data engineering in our current project, and he helped on the blog post.
Contact us!
For more information on data pipelines in general, take a look at our Data Pipeline Playbook. And if you’d like us to share our experience of data pipelines with you, get in touch using the form below.