Imagine a world where UDFs (user-defined functions) seamlessly integrate with your dbt workflow, automatically deploying, testing, and documenting themselves alongside your models. That’s been on my wishlist since I started using dbt three years ago.
To proceed with my recent work on vulnerability management behaviour in the Python ecosystem, I needed a solution for matching package versions to vulnerability reports. A reliable pure SQL solution didn’t exist, so I wrote my own, using dbt and UDFs in BigQuery. I wanted a better integrated way of managing this network of 16 or so interrelated UDFs and their tests than my previous solution – a big lump of SQL in an on-run-start hook.
Enter the lightbulb moment: treat UDFs as dbt models with custom materialisation. This dea unlocks a treasure trove of benefits:
- Effortless Deployment: No more on-run-start hook hacks. UDFs automatically deploy alongside your models when you run dbt commands like run and build.
- Testing Confidence: Sleep soundly knowing your UDF tests are well-integrated into the dbt build and test process.
- Seamless Referencing: Forget complex workarounds. UDFs become first-class citizens in your models, referenced like any other dbt object.
- Clear Documentation: No more mystery functions. UDFs are included in dbt documentation, making their purpose and usage clear.
- Concurrent Efficiency: UDF deployment doesn’t hold back your workflow. It happens alongside other model operations, optimizing your development time.
- Enhanced Metadata: Add descriptions to your UDFs, and they’ll magically appear in the data warehouse, providing valuable context for users.
- Granular Control: Need to exclude a specific UDF for a particular task? No problem! dbt grants you the flexibility to select and exclude UDFs as needed.
Creating a custom materialization is pretty straightforward for UDFs, because they are quite simple database objects. My minimal BigQuery implementation is the 25 lines of macro here. More importantly than that is how you use this new materialization. Here’s a real-life example from my public pypi-vulnerabilities repository:
{{ config(
materialized='udf',
parameter_list='specs ARRAY<STRING>, version STRING',
returns='BOOL',
description='True when the version string matches any of an array of specs, where each spec may have an upper and lower bound. Example: 1.1.1 matches >1.1.0,<1.1.2, does not match >1.1.0,<1.1.1"'
) }}
EXISTS (SELECT 1 FROM UNNEST(specs) spec WHERE {{ ref('matches_compound_spec') }}(spec, version))
That macro is called matches_multi_spec. `materialized=’udf’` is where I tell DBT this is a UDF model, and `{{ ref(‘matches_compound_spec’) }}` is a standard dbt ref to another UDF – no more target.schema hacks.
To give you an idea of how well this works – if I run `dbt build -s +matches_multi_spec`, asking dbt to compile, deploy and test the macro above and everything it depends on, dbt correctly works out the 14 UDFs that have to be built and tested.
08:50:05 26 of 26 START test test_udf_matches_multi_spec ............... [RUN]
08:50:05 25 of 26 PASS test_udf_matches_compound_spec ............... [PASS in 1.54s]
08:50:06 26 of 26 PASS test_udf_matches_multi_spec ........................ [PASS in 1.60s]
08:50:06
08:50:06 Finished running 14 udf models, 12 tests in 0 hours 0 minutes and 12.48 seconds (12.48s).
08:50:06
08:50:06 Completed successfully
08:50:06
08:50:06 Done. PASS=26 WARN=0 ERROR=0 SKIP=0 TOTAL=26
For a more in-depth explanation of the idea, how it works, and the implications, see the technical walkthrough post on tempered.works. There’s an experimental dbt package containing the UDF materialisation on GitHub. Finally, the pypi-vulnerabilities GitHub repo shows this technique in automated action as part of a real-world data pipeline,
This blog post is more than just a technical solution; it’s a glimpse into a future where UDFs become integral members of the dbt ecosystem, empowering data professionals to work smarter and more efficiently. So, buckle up and join the ride towards UDF nirvana!