Adam Fletcher

Data Scientist
Tech Focus

August 2, 2023

How synthetic data can speed up development

Access, governance and security are paramount when building data products. Incorrectly handled data can cause costly data breaches and reputational damage. However, these processes take large amounts of time, slowing down development. By leveraging synthetic data, companies can safely speed up development time.

Using synthetic data as an alternative to production data means that developers can build tools in parallel with governance and security, without risking production data use. This significantly reduces lead time as the development no longer proceeds in a waterfall manner. It also helps to improve security testing because the scale of synthetic data mimics the volume of production data, greatly reducing project risk.

Alternate Approaches

Traditional methods to solve these problems usually involve small amounts of manually created data or an anonymised production dataset. The challenge this creates is that small data won’t reflect the load demands of production data, while it has been demonstrated that there are proven ways to de-anonymise data to reveal sensitive information.

Synthetic data doesn’t contain any PII and can’t be de-anonymised, meaning it can safely be outsourced to other territories, where teams can develop tools on synthetic data without ever needing access to production data.

Use Cases for Synthetic Data

Reduce risk for data migrations

During a data migration, synthetic data ensures that testing can be done with realistic data loads, greatly reducing the risk of data leaks or errors. It is possible to use more synthetic data to understand if the system will be able to meet future requirements.

Single customer view

If you are combining data from multiple sources, records may contain slightly different versions of the truth. An address might have changed, names could be spelled differently, or login locations could change.  How do you connect these disparate data sources accurately?

Synthetic data allows you to make custom datasets containing data mismatches, along with unique identifiers that can be used as ground truth to validate matching. This means testing and accuracy statistics can be applied to the matching capability of the product, improving reputability.

Offshore workers

In many cases, production data can’t leave specific territories – for example, data may not be able to leave the EU because of concerns around GDPR. This massively reduces the available worker pool for projects. If an offshore team is able to develop using synthetic data, this allows organisations to work within data compliance using representative data. The end product can then be deployed directly into production with maximum compatibility.

Machine learning and analytics

Most synthetic data is reflective of production data in that data types will match, in addition to data limits like max / min values. However if your workflow involves analytics or machine learning you will need the synthetic data to follow the same distributions and correlations as the production data. Utilising synthetic data means that data scientists can develop models that would typically require highly-sensitive data in lower security environments.

However, to build a machine learning model, the model must access and learn the dataset. This can be a critical blocker for some projects. Another is proper validation that the model is truly generating new values and not copying the production data.

Making Synthetic Data

Generating typical synthetic data doesn’t require scanning or seeing the production data, with the exception of machine learning-generated data. It does, however, need some descriptive information such as table names, column names and high-level information about the type of data and any limits or patterns.

For example, you might need to set an upper and lower limit for numbers and dates that reflects production data. When it comes to text information, it is important to know if the column contains things like names or addresses, and the pattern of any IDs – for example, 2 letters followed by 4 numbers.

Conclusion

Synthetic data is a vital tool in developing products that can help teams get stood up faster, and build more safely, with proper load-testing. If you’d like to know more about applications of synthetic data, get in touch with someone on our team.

You may also like

Blog

Why testing technical documentation is crucial for product success

Blog

Just say no – to versioning APIs

Blog

How TuringBots are changing the game in software development

Get in touch

Solving a complex business problem? You need experts by your side.

All business models have their pros and cons. But, when you consider the type of problems we help our clients to solve at Equal Experts, it’s worth thinking about the level of experience and the best consultancy approach to solve them.

 

If you’d like to find out more about working with us – get in touch. We’d love to hear from you.