At Equal Experts, we have a lot of experience in helping our customers to integrate data science into their business capabilities. This requires data scientists and data engineers to work closely together, because models are usually dependent on data availability. But how many data engineers do you need? Our recommendation is one data engineer per data scientist.
Data science is dependent on data availability
Given a dataset, a data scientist can often come up with a good first pass model in a matter of days. Data scientists are great at maths, statistics and machine learning (ML) algorithms, and they’re blessed with great tooling and libraries. They use tools like Jupyter notebooks to quickly analyse data, prototype algorithms and share their results with clear annotations.
But data scientists need relevant data. They need access to historical data for the inputs and the outputs of the activity they’re modelling, and it’s typically locked away in multiple transactional systems. That causes two problems for data scientists:
- They usually don’t have the engineering skills or inclination to work with transactional systems. More than once, a data scientist has said to me ‘I just want to work with the algorithms’
- They don’t have approvals to access the data. At a financial services company, data scientists weren’t trusted to access data in a transactional database, for fear it would interrupt daily transactions
Data science progress is dependent on data availability. Data science impact is dependent on integration with the business. Both are dependent on engineers who are sympathetic to data science needs and ways of working. You need a cross-functional team of data scientists, data engineers, and others working together to integrate models into business operations. (You can read more about this in our free ML Ops playbook.)
The rule of thumb is…
The Equal Experts recommendation – one that I know is shared by many other data leaders – is one data engineer per data scientist. It’s a rule of thumb that depends on the complexity of the models, and the nature of the transactional systems involved (for a complex ecosystem with tricky data pipelines and a high impact model the ratio can even be higher – with multiple engineers per data scientist), but time and again this recommendation has helped our customers to achieve their business outcomes.
When you don’t have one data engineer per data scientist, the frustration is obvious. Data scientists spend their time waiting for the data they need to build their models, or they reluctantly do the integrations themselves. Data engineers are overworked as they try to continually provide access to data sources and maintain integrations with business operations. And business users don’t see the benefits they were promised from investing in data science in the first place.