After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed the majority of the data engineers I spoke to were experienced software engineers.
Based on this, I decided to create a blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering, and it might encourage more software engineers to become one.
This week the interview is with Austin Poulton.
What is data engineering for you and how does it overlap with software engineering?
They are really the same fundamental discipline in my view. The principles of functional decomposition, repeatability, testing, observability and monitoring, apply to both software and data engineering. Data engineering is a specialisation where practitioners are often fluent in large data processing technologies, patterns and architectures over and above good software engineering practices. Data engineering is maturing to embody well established software engineering practises. It’s not merely about wrangling data for ad-hoc analysis.
How did you get involved in data engineering?
In the early days of my career I worked on pricing and provisioning analysis for telco networks. We relied on lots of training data and simulating synthetic data. The need to have reproducible transformation pipelines even in analytical settings was essential. Later, my experience of trade processing for a risk engine at an investment bank honed concepts of stream processing, eventual consistency, denormalised representations, data provenance (lineage) and so on. I’ve always been deeply interested in how data is structured and modelled for analytical and decision-making applications.
What are the skills a data engineer must have that a software engineer usually doesn’t have?
Data engineers need to be more deeply aware of how data should be organised for analytical and transformation workloads. Understanding how large datasets should be partitioned so that compute clusters can accurately and efficiently transform the data is critical.
SQL fluency is essential, it is the lingua franca not only of relational databases but big-data technologies too.
Data quality is a massive concern for data-driven products. Understanding and working with tools that identify data issues, not only at the individual level but also for distributions over time is really valuable.
What data trends are you keeping an eye on?
The data science and engineering space is evolving continuously. That said, there are mature approaches emerging from the fermet. Data mesh architecture aligns well with building data products and domain driven design/organisation as opposed to an analytics lake or platform. ML Ops tooling and patterns are crystallising such that models have a ready and repeatable path to production and not consigned to static analysis in notebooks.
On the AI front, lots of interesting things are happening with natural language processing, such as the advent of GPT3. We are generating so much data that there is a world of opportunity in using AI tooling for structuring, tagging and linking semi and unstructured data.
Do you have any recommendations for software engineers who want to be data engineers?
Transition is easier than you think and it’s a really interesting specialisation! If you haven’t, I suggest that you read Martin Kleppmann’s Designing Data Intensive Applications, as it distills many of the problems and approaches you will likely encounter in your data engineering journey.