After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed the majority of the data engineers I spoke to were experienced software engineers.
Based on this experience, I decided to create a new blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering and it might encourage more software engineers to become data engineers.
This week the data engineer is Paul Brabban.
What is data engineering for you and how does it overlap with software engineering?
I’m not convinced data engineering as a specialism is really much more than a product of immature and inadequate tooling. A decade ago, we were largely stuck with vertically scalable, on-prem data warehouses that operated very much like a traditional relational database, albeit perhaps with column-oriented storage. Compute and storage limited how much you could really do there. In that world, specialised Data Engineers are needed to persuade tools like Airflow to move data to and around your warehouse, or can fiddle with indexes and configuration in the warehouse to make needed queries feasible.
Today, we have “modern data warehouses” like BigQuery and Snowflake that can directly consume a far greater variety of data than before, and once landed in the warehouse you can use “SQL pipelines” to express new concepts as views – with much more direct visibility of costs and much simpler setup. Some of these new warehouses even integrate things like machine learning to really reduce the need for data engineering skills and help folks like analysts, data scientists and decision makers really get things done.
I think there’s a big overlap with software engineering. I’ll leave someone else to speak on practices, and say that in the real world, very few software engineers can avoid data! Every team I’ve worked with has had some need to move data around – right now, it’s a team running a search system for a client’s website and apps. Somehow the data ends up in the search index – although these folks aren’t “data engineers” and what they’ve built doesn’t use “data engineering tools” it’s a data pipeline and no mistake. I’d argue that some data engineering experience could have smoothed that journey and made good use of existing tools, saving writing some code, but the job got done and that’s what matters!
In a nutshell – I see data engineering as helping others make data available and effectively use the data they need. I was about to point to specialisms like optimising larger datasets for performance but then I remember that software engineers do that too!
How did you get involved in data engineering?
I was a software developer for about a decade. I am still a software developer – I’ve just finished writing an application to rotate credentials for a data system – using all the linting, testing and code organisation skills I’ve gained over the years. I found myself drawn to the challenges of dealing with larger datasets, making slow things fast and the allure of the incredible insights that data can be hiding. I guess I’ve also always been drawn to evidence-based methods, too. Perhaps that explains why I spend so much time wrestling with AB tests!
What are the skills a data engineer must have that a software engineer usually doesn’t have?
Lack of knowledge of the tooling and ecosystem is one thing – it’s very easy to build your own thing unless you know a little about the terminology data folks use. I think experience would be up there – as a software engineer you’ll likely spend most of your time writing software and only occasionally tackle a data problem. The last thing that springs to mind is…SQL. Nowadays, it’s quite possible for a software engineer to go for long periods of time without coming into contact with SQL – but SQL is the foundation, and with SQL pipelines becoming more of a thing it’s only going to become more important to read and write it.
What data trends are you keeping an eye on?
I’m watching and prodding the Data Mesh approach with interest. I think it’s a really promising approach to solve two crucial problems with traditional data engineering – skills shortage and centralisation. There’s a lot in there, but organising into data products and decentralising responsibility seem like a sensible idea – particularly as it’s basically what we’ve already seen work with “normal” products. As I said, there is not so much difference between data and non-data engineering!
Do you have any recommendations for software engineers who want to be data engineers?
There’s a good chance you are already doing data engineering. Have a think about what you’ve done in the recent past, see if you’ve been involved with moving data around. I bet you have. Have a look around for blog posts and the like for how others approached those problems, you might get some ideas and jumping-off points into how data engineering techniques might have saved you time or improved the product.