After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed the majority of the data engineers I spoke to were experienced software engineers.
Based on this experience, I decided to create a new blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering and it might encourage more software engineers to become data engineers.
This week the interview is with Will Faithfull.
What is data engineering for you and how does it overlap with software engineering?
Personally, I don’t see them as particularly separate disciplines, although the area of focus differs in a few ways. Firstly, it’s expected now that a good software engineer has a decent understanding of the domain in which they work, and I see data engineering as an extension of this principle.
Secondly, you’re still building, testing, packaging and deploying applications at the end of the day, it’s the nature of the applications and the domain you’re working with that differs. Probably the biggest change is that the code you’re writing isn’t (necessarily) concerned with handling requests! I’d say it’s the perfect switch for a software engineer who feels a bit compartmentalised and would like to have a chance to be involved in everything technical.
How did you get involved in data engineering?
You might ask how I got involved in software engineering in the first place – I was doing a PhD, teaching at university, and tired of being destitute on a lab demonstrator salary, so I started my own company.
I was working in software engineering and tech leadership, and didn’t really do any data engineering until 2020. It was actually a silver lining moment. In March 2020, at the first onset of the COVID pandemic, the project I was working on as a software engineer was shelved in the midst of the crisis. I was offered the opportunity to make a sideways move into data engineering and I was enthusiastic about such a move not because of the circumstances, but because I always had a lingering interest in data and data science, and a sense that I didn’t often get to use the skills I learned during my PhD.
I now split my time between data engineering and data science, so I’ve been able to carve out the exact role I wanted.
What are the skills a data engineer must have that a software engineer usually doesn’t have?
When I made the transition to data engineering, I was a bit in the dark about exactly what data engineering entailed, but my academic background gave me the confidence that I would be able to fill in the gaps – as it turns out I think the inverse was true, it was actually software engineering fundamentals that I leant on. Testing, application packaging, deployments, infrastructure, SQL, Linux and bash skills. 95% of it is the same principles, but the details of the applications you’re building and deploying differ.
That’s not to say that there aren’t skills and specialisms within data engineering that are less common to the average software engineer, but advantages in data disciplines include:
- Infrastructure familiarity and infrastructure as code. I’d say that for data problems, applications tend to be shallower and the infrastructure deeper than equivalent software engineering work.
- Breadth as opposed to depth of knowledge. In data applications there tends to be fewer cookie-cutter solutions and you have the opportunity to solve things in creative ways if you have the perspective.
- Advanced SQL. When you’re working with huge volumes of data it is often more efficient to solve problems in SQL than in application code where possible.
- Familiarity with storage paradigms, memory layouts, distributed computing frameworks and multiprocessing/parallelisation. Whether it’s optimisation or a bug, you will probably run into some problem that involves these sooner rather than later.
- Familiarity with how data scientists approach problems and want to use data.
What data trends are you keeping an eye on?
I think graph databases are extremely underutilised but people have caught on to the potential. There’s also a trend towards incremental processing and point-in-time analysis. Both of these things relate to fundamentally the same point – that we have a tendency to flatten and aggregate data, whether that is overtime (depth), or associations (breadth). But the tools exist now to make sense of subgraphs and connections in the data without having to lose any information by flattening it, even at a large scale. The tools have some maturing to do, but the capability is there.
Do you have any recommendations for software engineers who want to be data engineers?
Don’t be afraid to make the jump. If you have an interest or curiosity in the data domain and analytics side of things, even if you know next to nothing about them already, that will stand you in great stead. 95% of the skills will be second nature to you, so you only have to focus on that remaining 5%. You really have a chance to carve out the exact role you want under the auspices of the “Data Engineer” title.