After a few conversations about what a data engineer is these days, I’ve found that there isn’t a shared understanding of the role. I also noticed the majority of the data engineers I spoke to were experienced software engineers.
Based on this experience, I decided to create a new blog post series that consists of interviews with our data engineers. I believe this will help to demystify data engineering and it might encourage more software engineers to become data engineers.
This week the interview is with me. 🙂
What is data engineering for you and how does it overlap with software engineering?
To make my opinion clear, I’m going to describe the goal of data in organisations so I can move into the specifics of data engineering.
Organisations have multiple systems generating data, transactional, usage data, etc. This data is valuable because it could and should be used to support decision making within an organisation. Sometimes this decision making is achieved by a manual process by looking into dashboards or reports generated by data analysts, some other times it’s achieved by machine learning algorithms created by data scientists that do the decisions by themselves.
Making this data available and usable is the main work of a data engineer. From a high level perspective, the work consists of collecting, modelling, cleaning, transforming the data into datasets, or data streams ready to be consumed. It sounds simple but there are a lot of idiosyncrasies as you can imagine such as data security, schema changes, very big datasets, data quality, etc.
So, how do data engineers solve their tasks? Most of the time it’s by doing software engineering.
That being said, why do we have two different roles? Because the data area is so broad and it requires specific knowledge and a specific mindset to work with (the product is the data, the data needs to be tested apart from the software, etc).
I tend to see data engineering as a layer of skills and knowledge on top of software engineering skills. Also, I believe some software engineers have been working on data engineering tasks without being called ‘data engineers.’
How did you get involved in data engineering?
I started to work on data during my master thesis (and the following research) which was focused on natural language processing for Portuguese. I know it sounds more like data science, and it was. After this research phase, I worked for a couple of years doing software engineering and I ended up in a project where the client asked my team to do an ML model to predict user behaviour. I ended up working on the data science part and also developing an ML model pipeline and an infrastructure to make AB testing of models. These days, we tend to call this last part ML engineering, although I see it as part of data engineering. After this project, I’ve been working on data projects, mostly in data pipelines.
What are the skills a data engineer must have that a software engineer usually doesn’t have?
Starting with some generic topics:
- Strong SQL skills is a must-have.
- Knowing the details of different types of data storages (transactional databases, data warehouses, distributed file systems) is also very important.
- Data modeling is an important part of the job.
- It’s good to know how data scientists and data analysts work because they’re usually the clients of the data.
- GDPR and data security speak for themselves.
Stream processing and big data processing are also very good skills to have because a good part of projects require them. It’s also good to know the data landscape, which is huge, and there isn’t a preferred stack that is consistently widely used.
What data trends are you keeping an eye on?
I’m closely following the trend about leveraging the power of the modern cloud data warehouses which separate storage from computing like BigQuery or Snowflake, to make SQL data pipelines in an ELT fashion. It’s a game-changer, in my opinion. The need to use tools like Spark and to have specialised engineers and infrastructure is minimised. Having the data pipelines in SQL allowed data analysts to participate in the transformation part, and there is a new role emerging based on that, analytics engineering.
I’m keeping an eye on data mesh, with a focus on treating data as a product and having a central team that facilitates the work on data for other teams.
I’m also interested in AutoML. As mentioned, I also have experience as a data scientist, not much, but sufficient to believe that some of that work can be automated. I do believe that AutoML can help data scientists, I don’t believe it can replace them.
Do you have any recommendations for software engineers who want to be data engineers?
It should not be hard if you already have software engineering skills. The field is broad, so you might want to choose one area that you would like to work in, data pipelines for instance, and start to study it from a practical perspective. If you are into books I recommend Designing Data-Intensive Applications as a general data book. Also, if you are part of a project where you can pair on data engineering tasks, give it a try.