Tech Focus Wed 9th August, 2023
Avoiding conflict between Data Engineers and Scientists
Every company wants to extract the most value from data, but many still find it difficult to deploy the insights gained from data science. This can be caused by a lack of understanding about the roles and skill sets of data scientists and engineers.
At first glance, it seems that data scientists and data engineers work with the same medium: data. However, the difference lies in the output each group creates. A data scientist’s key outputs are insights, while data engineers output products.
Understanding the interactions between scientists and engineers, and utilising those differences, is key to efficiently and effectively publishing data solutions. In this post, we’ll discuss some common pitfalls that companies can encounter when trying to productionise data solutions, and strategies that data scientists can employ to create a good foundation for engineers to build upon, without asking the data scientist to develop an entire system themselves.
Antipatterns to Avoid
Treating The Value of Notebooks as The Code, Not the Insights
Data scientists often develop solutions using notebooks such as Jupyter. While notebooks are fantastic for quickly exploring data, hypothesis testing, and prototyping, they should be treated as methods of getting quick insights, rather than enduring code.
Typically, notebooks record all of your explorations, including all the dead-ends, visualisations and tangents that you made. This is unnecessary and confusing in a production code base. One advantage of notebooks – the ability to run code out of order – can become a weakness in the production code base, because the code will probably not run top to bottom. Furthermore, notebooks don’t easily support testing, which is critical to developing robust and maintainable software. Essentially, once you have produced insights, you should be prepared to throw away your notebook code and start again.
It can be tempting for data scientists to hand notebooks directly to the engineers to productionise. However, this causes slowdown, headaches and tension for data engineers, who now have to filter relevant and irrelevant information and understand the complex analysis in addition to their role of integrating this method into the existing ecosystem. The data engineer often needs to painstakingly rewrite the entire notebook.
Getting the Data Scientists to Deploy their own Products
Some companies have data scientists build the initial product. Data scientists are typically skilled in mathematics, statistics, and algorithm development. Most are not equipped to build a fully automated, CI/CD data platform. The results from this approach can lack the maturity needed from an automated system, and take significantly longer to build.
Strategies to Employ
Convert Your Insights into a Proof of Concept
As a data scientist, once you have valuable insights from your notebook that you wish to turn into a product, consider transforming the notebook into a proof of concept (POC) that can showcase the value of the insights. This quickly turns notebooks into simple self-standing code for the engineers, and also allows users to start using the product immediately and provide valuable feedback.
Writing tests help engineers quickly understand and refactor code if needed. Focusing on testing in areas that fall outside the engineers’ typical experience (such as model testing) gives teams more confidence to focus on building an end-to-end product. You want to ensure that the core aspect of the analytics product has good testing, while peripheral aspects don’t require this.
Involve Data Engineering Early
As soon as you want to move forward with a data solution, engage with data engineers. Setting out requirements to the engineering team early helps establish context for the larger ecosystem you are deploying into, and allows the engineering team to feedback on how the transition should look. Data scientists should work with engineers to understand the development and operations lifecycle and discuss things such as:
- How should errors and other info be captured and logged?
- How do you want configuration variables to be managed?
- How do you pass in parameters?
The relationship between data scientists and engineers can be complex, but with the right planning at the start of a project or engagement, it’s possible to avoid many of the most common pitfalls.