Our Thinking Mon 7th March, 2022
Testing and deploying UDFs with dbt
Edit: You might be interested in our updated solution to this problem: UDF Nirvana? A new way to integrate UDFs into the dbt ecosystem
Inspired by our previous post on unit testing dbt and discussion in the dbt forums, I looked into how to get dbt to help me test and deploy UDFs. Although dbt doesn’t directly support UDFs, it turns out it’s actually fairly straightforward to define, test and reference them in your models. I’ll be using BigQuery in this post, but I think the techniques will apply to other warehouses that support UDFs like Snowflake, Redshift and Spark, too.
What’s a UDF and what does it have to do with dbt?
A UDF is a User-Defined Function. Like any other type of function, they’re really helpful to capture and name chunks of logic that you can apply in queries, and I talked about them before, starting from Tip 4 in our Data Warehouse Tips post. dbt takes care of deploying the tables and views that you might use your UDFs in, so robustly referencing a function from your models is important. When I say ‘robustly’, I mean ensuring that the function exists and that you’re calling the function from the appropriate location – for example, avoiding hard-coding your production functions everywhere. There’s no native support for referencing a UDF like there is with `source` or `ref`, nor any native way to deploy a function in a project. This post explains how you can solve those problems.
The example
I have a BigQuery dataset that I use for my personal development work. This is referenced in my own dbt profile. When I run dbt commands on my machine, they take effect here.
There are test and production datasets that my CI system has profiles for. I cannot run ad-hoc dbt commands in these datasets. When I push my changes, the CI system will run any dbt commands necessary to deploy my updates and run tests. I’ll not be talking about that in this post, but it’s important that I’m referencing the UDF that is deployed in the same environment that I’m working in – otherwise I can’t iterate safely, relying on my tests.
The example UDF I’ll use is a simple predicate function, located in `my_schema`, from string to bool that returns true if and only if the string value can be interpreted as a positive integer:
CREATE OR REPLACE FUNCTION my_schema.is_positive_int(a_string STRING) RETURNS BOOLEAN AS ( REGEXP_CONTAINS(a_string, r'^[0-9]+$') );
Defining a UDF in dbt
We need to write a dbt “macro” – a Jinja template – that parameterises where the UDF should go. In `macros/is_positive_int_udf.sql`:
{% macro is_positive_int_udf() %} CREATE OR REPLACE FUNCTION {{ target.schema }}.is_positive_int(a_string STRING) RETURNS BOOLEAN AS ( REGEXP_CONTAINS(a_string, r'^[0-9]+$') ); {% endmacro %}
Note the macro name matches the file and that `target.schema` is templated out.
To have dbt actually deploy this for us, we add a hook in our `dbt_project.yml`:
on-run-start: - '{{ is_positive_int_udf() }}'
Now, when we `dbt run`, the UDF will be deployed.
11:11:52 Running 1 on-run-start hooks 11:11:52 1 of 1 START hook: my_project.on-run-start.0.................... [RUN] 11:11:52 1 of 1 OK hook: my_project.on-run-start.0....................... [OK in 0.49s]
Kind of yucky in the console output with just list indices to identify what’s what – but not the end of the world.
Testing a UDF in dbt
Now our UDF is schema-aware and is being deployed by dbt, we can write tests for it with minimal ceremony. In `tests/is_positive_int.sql`:
WITH examples AS ( SELECT 'foo' the_string, FALSE expected UNION ALL SELECT '1111', TRUE UNION ALL SELECT '5 not this', FALSE UNION ALL SELECT 'still not this 5', FALSE UNION ALL SELECT NULL, NULL ) , test AS ( SELECT *, {{ target.schema }}.is_positive_int(the_string) actual FROM examples ) SELECT * FROM test WHERE TO_JSON_STRING(actual) != TO_JSON_STRING(expected)
Running `dbt test` runs the pre-hooks and then the tests, perfect.
... 11:25:22 Running 1 on-run-start hooks 11:25:22 1 of 1 START hook: my_project.on-run-start.0.................... [RUN] 11:25:23 1 of 1 OK hook: my_project.on-run-start.0....................... [OK in 0.58s] ... 11:25:24 1 of 1 START test is_positive_int............................... [RUN] 11:25:25 1 of 1 PASS is_positive_int..................................... [PASS in 1.02s] 11:25:26 Finished running 1 tests, 1 hooks in 1.46s.
Using the UDF in a Model
We’ve actually already seen how to do this, in the test.
In `models/my_model.sql`:
SELECT {{ target.schema }}.is_positive_int(maybe_positive_int_column) FROM {{ ref(...) }}
Summary
There you have it. In summary, to incorporate UDFs neatly into your dbt models:
- Write a macro for your UDF, templating the target schema
- Add a call to macro to `on-run-start` in your `dbt_project.yaml`
- Write some tests, of course
- Use the UDF in your models, templating the target schema