After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed the majority of the data engineers I spoke to were experienced software engineers.
Based on this, I decided to create a blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering, and it might encourage more software engineers to become one.
This week the interview is with Yogendra Valani.
What is data engineering for you and how does it overlap with software engineering?
Of the data engineering projects I have worked on, I’ve seen a common fundamental requirement to provide reporting and data-driven insight that illustrates user behaviour and the impact on commercial goals. This is often an afterthought for the very teams that produce features or products. The shift towards using microservices has meant that data is now stored in various systems and formats. The main challenge data engineers face is collating multiple sources of data into a single place (i.e. data warehouse, data lake), to easily query and cross-reference data.
Traditionally, many of the problems solved by data engineers required being creative when working with limited server processing power and capacity. Due to this, most senior data engineering roles required experience maintaining database servers, optimising SQL scripts, and building search optimised indexes. A finely balanced trade-off had to be made between granularity and aggregation. Changing this trade-off required refactoring tightly coupled scripts, with multiple dependencies and backfilling data, which could be long and expensive.
The introduction of cloud-based solutions such as Google Big Query or Amazon Athena has enabled a new type of data engineering paradigm known as ELT (Extract, Load, and Transform), as opposed to ETL (Extract, Transform and Load). With the new tools, source systems are now copied in their entirety, helping data analysts and scientists work with raw data. Data structures are much easier to change, whilst the need to backfill from source systems is eliminated.
The role of a data engineer is evolving to be closer to that of a software engineer. We see demand from our users to build tools to interrogate data, whilst also mimicking contemporary software engineering practices by including automated tests, CI/CD pipelines, alerts and monitoring. A great example of this is DBT (https://www.getdbt.com/), a tool used to make SQL scripts in the transformation stage smaller, easier to read, maintain and test.
How did you get involved in data engineering?
I have always wanted to combine using my maths education with software development. After a hackathon project I joined the data engineering team at Just Eat. The team had started migrating from a RedShift database to Google Big Query. They were overwhelmed by constant firefighting and considerable resistance from data analysts in migrating all the analysts reports to yet another system.
We changed the migration strategy from a big bang switch over to working on a report-by-report basis, co-working with analysts to solve a complex reporting problem around delivery logistics. Trust in our data and platform grew, resulting in the onboarding of more users. Our backlog quickly changed from migrating a list of source system tables and associated reports, to use case driven feature requests.
What are the skills a data engineer must have that a software engineer usually doesn’t have?
Data isn’t always structured in formats that are easy to query. Most software engineers would find it easy to code for example a hierarchical structure that requires tree traversal. Trying to work with such a structure in SQL, perhaps with loops or recursive functions, requires a different thought process.
Another key area to research and understand is the technical challenges and trade-offs between streaming versus batch processing. Batch processing tends to be much easier and cheaper than streaming with the majority of requests requiring a batch processing solution.
What data trends are you keeping an eye on?
As more software engineers move into working on data engineering, I’m looking for tools that improve the development experience. I have been working with DBT and Airflow Cloud Composer. One of the most exciting libraries I have seen is the unit test framework (created by EE developers) for DBT. This has been an absolute game-changer, in terms of my development experience as I have been able to use test-driven development to write SQL scripts!!
Do you have any recommendations for software engineers who want to be data engineers?
Join a multidisciplinary team consisting of both software and data engineers. As database technology has evolved and many of the traditional approaches are no longer valid, it’s important to challenge the status quo.