After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed the majority of the data engineers I spoke to were experienced software engineers.
Based on this experience, I decided to create a new blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering and it might encourage more software engineers to become data engineers.
This week the interview is with Lewis Crawford.
What is data engineering for you and how does it overlap with software engineering?
I think the evocative term here is engineering – precision, reliable, secure and planned as opposed to ‘hacking’ which also has a lot of commonalities between data and software. Most data engineering has an aspect of software engineering for transforming the data into some more useful state, and most software has the ability to store, retrieve and modify data. The key difference is the scale of the data involved.
For me, the analogy to engineering in data engineering is best exemplified by a bridge. It is easy to visualise the roles of architects, structural engineers and builders that allow traffic to safely and securely move from one domain to another. Data engineers provide this bridge for data flows.
How did you get involved in data engineering?
I started with distributed computing for my MSc, processing satellite images to create drainage networks. There was always more data that would fit on a single ‘computer’. Using PVM condor for scheduling parallel processing jobs usually stuck in long queues behind physics and engineering simulations, the benefits of unit testing and code review are never more apparent than waiting two days to run your job only to find a simple spelling mistake, and being put at the back of the queue again. I ended up in roles in various companies that all had a distributed compute element so it was natural that I gravitated to ‘big data’ processing around about 2009. Along the way, I picked up a lot of experience around patterns and architectures, lineage and governance, data types and storage formats, in addition to the old problem of orchestrating multiple computers to perform a single task.
What are the skills a data engineer must have that a software engineer usually doesn’t have?
I would say concepts rather than skills and an appreciation of scale. Understanding why choosing the right storage format or why partitioning strategy can significantly improve the ability to perform analytics at scale. Understanding that processing data comes with a responsibility not just in ensuring quality and consistency but custodianship, data security, retention and governance. You don’t have to be a GDPR expert, but you do need to know that someone on the team has to ask the questions – what data are we using, for what purpose, for how long, where did it come from, and who is the end-user?
At some point, scale catches everyone out. Realising that your program that takes 1 second to process 1,000 records is going to take 11 days to process 1bn records – and that is if you don’t hit all kinds of limits you didn’t know where there (temp space, logging directories, etc.).
What data trends are you keeping an eye on?
I am fascinated by machine learning and the incredible advances in AI generally. So I tend to focus on the distributed architectures that support this in the data preparation stage such as Ray and Dask, as well as obvious platforms such as Spark. I am also interested in accelerated compute using GPU’s for both deep learning and transfer learning, but also frameworks like rapids.ai, that enable existing workloads to transfer to gpu with minimal code change.
Do you have any recommendations for software engineers who want to be data engineers?
Go for it! But it may be worth reading/chatting about/trying out some of these concepts – metadata, governance, lineage, OLTP / OLAP architecture, partitioning, columnar, eventual consistency, quality, and most of all SCALE!