After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed the majority of the data engineers I spoke to were experienced software engineers.
Based on this, I decided to create a blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering, and it might encourage more software engineers to become data engineers.
This week the interview is with Jennifer Stark.
What is data engineering for you and how does it overlap with software engineering?
Data science has evolved so much in the past several years. For me, data engineering is what was (and perhaps still is), considered the first 80% of data science – data sourcing, data evaluation/profiling, cleaning, normalising, and calculating or creating new fields. What’s different now, with the evolution of data engineering as its own distinct field, is that data is processed not for a specific end goal or question necessarily, but in a way where the output can serve several end users or stakeholders, and answer several questions. The output can be used more flexibly. This has been achieved in part with the productionisation of data engineering, formalising it using software engineering practices, working in teams rather than as siloed individuals, code reviews, pair coding, mobbing, and retrospectives, testing, continuous deployment and integration, reporting/monitoring. Software engineering principles have improved the broader world of data science, and I believe can go a long way in improving how data is handled in academia.
How did you get involved in data engineering?
I wanted to leave academia, but wasn’t sure what I’d do instead. I knew I enjoyed several aspects of academia – research, experiment planning, analysis using R, data visualisation also using R, and creating presentations (I enjoy using reveal.js because you can build slide decks using html/css, or markdown meaning you can embed custom animations, agent based models, video etc). I was not so keen on writing papers or relying on external funding and having to move every two to three years.
I took a 9-5 research assistant position while completing a part time masters in information visualisation which consisted of coding, statistics, and graphic/web design, among other things. I wanted more coding and stats, and data science was just starting to take off, so I then did a part time bootcamp in data science with python. I really enjoyed that, and got a postdoc position in computational journalism for 18 months out of it and an article published in The Washington Post where I used my new python skills.
After that I explored data science roles in industry and got a role as a data engineer. It appears to be a rather common thing, where a company wants to become data-led and do data science, but they have no pipelines and their data is everywhere! My role was to establish some pipelines and then I’d become the data scientist, and the engineer role was to be backfilled. Unfortunately priorities changed and I moved to another company. I’ve now been a data engineer for three and a half years.
What are the skills a data engineer must have that a software engineer usually doesn’t have?
I’ve never been a software engineer, so I might be wrong with some of these. But familiarity with how data will be used, and the impact certain data cleaning or processing steps might have on how the data is used by the consumer. I try to consider how the data might be mis-used unintentionally, and what I can do as an engineer to mitigate that, including tests, documentation, data dictionaries or other supporting metadata.
For example, how best to deal with missing values might depend on what the data represents (discrete values, time series data, categories), how the API was designed, or on how the data will be or could be used. Is it best to fill the missing value with the average of the values on either side? Is filling a missing value with a null or a zero better? It all depends. Just being aware of these issues means that you can be proactive and seek advice from the domain experts for that particular data set – be it end users or the data providers – in order to select the right approach.
What data trends are you keeping an eye on?
I’m always a bit cautious when something is “trending”. Especially when something is presented as “the way we should be doing X now”, as I think it usually depends on the application context. It’s not a one size fits all.
Having said that, I am keeping an eye on MLOps which is a facet of data engineering that is maturing into its own speciality. It’s a very fluid space at the moment, with tech itself and principles developing as we all try to figure out how to do it, which is quite exciting!
Do you have any recommendations for software engineers who want to be data engineers?
I think this recommendation is valid no matter what your background, but I’d say lean on your teammates and ask questions, sense check your ideas, etc. Also, I love mapping things out in Miro, but maybe that’s an answer to a different question.
As someone who has hired data engineers at junior and mid-levels with job ads citing software engineer experience as relevant, I believe software engineers can move into an equivalent position level-wise (e.g. mid-level software engineer into a mid-level data engineer). As with any role, I’d look for a team that’s collaborative. In this way, you’d gain expertise in data engineering that’s not covered by your software engineering experience, while you upskill the rest of the team in their software engineering game.
Other folks in this series have said SQL. Yes! True also for anyone who works with data as an analyst, scientist, engineer, etc. I’d love it if SQL was more of a business-wide skill, like Excel, but that is probably just wishful thinking 😉