Data is so important to our clients, and at the same time the technology and tooling changes so rapidly. The
landscape available to data practitioners is changing all the time. We ask ourselves: What architectures
should we use? What ways of working should we adopt? What new tools have we found useful? What looks great
but actually gets in the way of making data useful?
These are our recommendations and insights into what we think you should use, explore or avoid when it comes
to all-things data.
Organisations struggle to use their data to drive decision making. Some common scenarios we see include:
Lack of trust in the data. Either information is out of date due to
persistent failures in the flow of data or the data does not reliably represent what is actually
happening in the organisation.
Inability to change the data. Existing data pipelines or reports cannot
be changed for fear of breaking the existing information, so insights fail to keep pace with the
changing organisation.
Inability to access the data. Data is locked away or held hostage in
current operational systems and cannot be accessed to provide insight and to aid decision making.
Recommendation
Adopt and use modern engineering practices when building data pipelines to manage the flow of data, so
that pipelines are reliable and can be changed safely and frequently.
Data pipelines are no different from any other kind of software and benefit from the same practices that
have been proven to accelerate software delivery in other areas: infrastructure as code; configuration
management of the pipeline; continuous integration and deployment; working in small batches; test driven
development and monitoring and observation of the pipeline at various stages.of the pipeline at various
stages.
These practices have been shown to increase the ability to deliver software and to reduce rework. But in
the data world, pipelines are often not developed and supported using these best practices, leading to
failures in data pipelines, long lead times to make data accessible to end users and ultimately loss of
trust in the data. Adopting modern engineering practices in a data space, such as DataOps, means that
you can make data flow to people who need it more reliably, and create new flows more quickly, improving
your ability to understand your business operations and customers.
Organisations have invested in machine learning but are not seeing tangible benefits. Some scenarios we
see are:
You can't adapt ML models fast enough or your ML models don't perform in
production as they do in 'lab' environments.
There are many Machine Learning Proofs of Concept but they take too much
time and effort to keep going or to operate at scale.
The organisation has a regulatory need to explain why a decision has
been made - how do you make sure you can see with confidence why this loan/medical diagnosis/risk
evaluation was made?
Recommendation
Bring modern engineering practices to bear on organisation's machine learning capabilities - to reliably
produce and operate ML-based services.
Data scientists are great at developing algorithms and generating insight but do not always have the
software engineering skills to deploy a model to production quickly and reliably. Applying established
DevOps techniques such as Infrastructure as Code for management and deployment, monitoring and alerting
of model performance and a versioned model repository enable rapid delivery, observability and
experimentation of ML within an organisation.
Starting out with a steel thread to prove the usefulness of ML allows for quick assessment of delivered
business value without heavy upfront costs.
Organisations have invested in data platforms/warehouses etc. but they are not being used by analysts,
data scientists, or other business users.
Typical scenarios are:
We cannot drive quality into the data for our data scientists, analysts
and operations managers. I cannot get the data in a format that works for them.
I have built a one-size-fits-all data platform but the data users don't
like it because it's not the data they want; they don't believe the quality of the data and they
don't like the way it's presented.
Recommendation
Adopt a Domain-Driven approach to the creation of data pipelines in which one team owns the quality and
the delivery of the data for that domain.
Domain-Driven Design has been a tried and tested technique in software development for many years. Its
approach generates components that have an isolated bounded context, making it easier to provide product
ownership, organisational understanding and focused delivery with fewer cross-team dependencies and
increased user engagement.
This approach has lagged in coming across to data processing, but it's one we have seen success with for
the same reasons - what works for a microservice works for a data pipeline. Being experts in the domain
of that data, the caretakers of the pipeline can ensure data quality, ease of use and clarity of domain
representation. The power of this approach multiplies as more domain driven data is created in the
organisation, as it is more easily shared and aggregated, sparking the creation of new business
insights.
Organisations do not know whether their models are adding value or which models they should use.
How can they move from gut instinct or HIPPO opinions? Typical scenarios are:
My teams are building lots of models. How do I know if they are making a
difference to my business goals?
My data science team has built a number of models for the same purpose.
Which model should I use?
Recommendation
Use multi-variant testing to run multiple models at the same time and evaluate how they affect your
business metrics.
A/B testing combines the common-sense approach of trying out different variations at the same time, with
statistical analysis to run experiments and measure the effect on business outcomes. There are powerful
tools to support this, which help set up experiments and collect the measurements. In many cases it is
possible to run multiple models at the same time, and to move business operations to the 'winning' model
during the experiment - gaining business benefit whilst the experiment is still in flight.
Often the end result (increased revenue, reduction in fraud etc.) is too difficult or takes too long to
evaluate to be of use experimentally. Instead, identify an intermediate metric which is correlated with
the required end result.
Data technology architectures have become very complex.
What used to be a simple database running SQL queries is now a complex technology stack to process data
at scale. I need to manage and hire for these but managing and getting insights from the data is
becoming complex and expensive.
Organisations have very complex data analytics architectures, which put the support of data pipelines
out of the hands of the data scientists and analysts who want to use them and who understand the data
best. How can we reduce some of the complexity of these architectures?
Recommendation
Utilise hyper-scale cloud data warehouses and the power of SQL to implement key parts of the pipeline in
a language that data analysts and scientists can use.
These days modern cloud data warehouses like Snowflake or Google BigQuery allow compute and storage to
scale independently, enabling scaling data analysis over petabytes of data. We no longer need to worry
about initial sizing or cluster management, because this is managed automagically by the warehouse.
Using these technologies, data processing and data analytics can be undertaken wholly in SQL. SQL is a
language well known to data analysts and data scientists - making it much easier for them to contribute
to the development, operation and improvement of the data pipelines, as well as reducing the dependency
on data engineering functions and thereby allowing organisations to move faster with their data.
Organisations have created architectures to bring data together in a repository but are now unable to
meet the needs of data users to access the data or to make data available.
Some common scenarios are:
It is difficult for data users to get access to new data sets.
It is difficult for data owners to easily provide data to other data
users.
There are quality or usability issues in data from data platforms.
Recommendation
Explore implementing data mesh concepts and patterns in your organisation.
The data mesh architecture focuses on creating a domain driven data platform in place of centralised
monolithic data stores. This approach embeds the learnings of digital platforms and applies them to
data, to change the view of data from a hoarded commodity to a product in its own right.
It brings a common data storage pattern (the mesh) together with Domain-Driven Pipelines and Data
Product Owners to power the end to end delivery of self-contained data products that can be accessed the
same way from analytics tools, microservices and data processing tools across an organisation's estate.
They are self-serve platforms with an underpinning Data Infrastructure as a Platform team that enables
the domain teams, and provides key common services such as security, data quality and data
discoverability.
These all combine to create an architecture where data teams are self-reliant, have clear boundaries and
direct interactions with data producers and end users, resulting in faster delivery of better quality
data products across the organisation.
Organisations know there is valuable data in their business, but it remains unusable or accessible only
to very few people or no-one at all.
Valuable data from business processes is ignored, trapped or treated
only as useless exhaust. Organisations see data “held hostage in on-prem systems”.
Data is not usable because of gaps in the data or poorly formed /
structured data. For example,phone numbers and addresses are not consistently formatted for use,
while data codes and categories are poorly understood.
The same or similar data sets are being created many times in the
organisation.
Recommendation
Explore thinking about your data as products with end users who should be treated as any other end user.
Have dedicated people whose job is to make sure the data can be used by the people that need it.
Data as a product is a domain-bounded isolated data-set that has value to data users - such as database
table(s) or an API. The users of a data product interact with it in an ad-hoc manner that isn't guided
by a specific set of user interactions. This is what gives a data product the power to enable a
data-driven organisation. For example, a single data product may be:
Surfaced through a business intelligence tool for user generated
reporting.
Joined by data scientists with other data products to enable Machine
Learning.
Brought into a operational data store for real time usage by a
microservice.
Leveraged by data engineers in a data pipeline to create new data
products.
Data products need to be valuable to their users - they must be useful data; and they need to be
trusted. Like the product owner role in application development, data product owners are accountable for
making sure the data is successful and meets these needs. They make sure the data is accessible to the
analysts, data scientists or business users who need it in a form that is right for them to use, and
that it is of the quality required by them.
Thinking about data as a product with customers and assigning data product owners allows organisations
to innovate and move faster because data governance is performed by someone that is accountable for this
specific data domain and the value created by it.
Organisations create data architectures and platforms but these get stuck in complex architectures that
impede their ability to make data available to data analysts and scientists and other data users.
Recommendation
Explore implementing a paved road for the creation of data pipelines.
The paved road approach has been successful in accelerating the delivery of digital services. We
recommend applying it to data pipelines. Create a 'Hello World' base repo with simple pipeline from
ingest to end user accessibility, which includes observability, testing, readme/wiki etc. so that dev
teams can rapidly put together new ones.
With the exponential growth of data, and data work being spread all over an organisation in different
teams, there is a need to have some kind of uniformity and shared practices. Paved Road for Data
empowers teams to work on data by leveraging a set of self-serve tools and best practices. It means that
teams can work easily with data, without losing the uniformity around tools and architectures inside an
organisation. It fits the Data Mesh architecture which advises having a Data Infrastructure as a Data
Platform team, which should be domain agnostic and should focus just on creating the paved road. This is
different from centralising data engineering; the team is horizontal on the organisation but acts as
facilitator to other teams in a self-serve approach.
Organisations want to find insights about their business but the data is spread over many systems.
Recommendation
Explore the latest tools for making federated queries over many data sources.
Organisations alway struggle with getting the right data to the right users at the right speed. One key
reason for this is because the data needed by users is spread over multiple sources. Traditional ETL and
data warehousing models worked well for reporting purposes, but with the growth of advanced analytics
use cases and near real-time data needs, these models are proving unsatisfactory.
Data virtualisation products can remove the ETL requirements by enabling federated queries across your
data estate. Products in this growing space include AWS Athena, Denodo, DremIO, Trino. The ability to create a catalog of available curated data, de-couple
users and sources, and provide a single rich interface enables you to efficiently and safely share data
enabling faster reporting and analytics.
Inability to rapidly find the data needed for analysis or insight
generation.
Data pipelines re-implemented many times over.
Recommendation
Explore the use of data catalogues/data discovery tools - find the right approach for your organisation.
One way to address this is by creating documentation for the data. Some organisations are using tools
like spreadsheets or wikis, some use other fully manual tools that are made for the purpose. Although
the documentation tends to be avoided, forgotten and consequently outdated. However, a new type of
centralised data catalogs is emerging which uses automation to generate the data catalog, looking into
the data lineage and usage patterns of data. Being able to know which datasets are available to explore,
to know who owns the datasets and how they were generated (data lineage) and used, empowers data
scientists and data analysts with a self-serve way to discover and explore the full value of data.
The ease and the speed to explore data and innovate with new products is a key in the data landscape of
data driven organisations, so a data catalog which relies on automation and not just only in human-made
documentation is a must have to empower data discoverability.
Enterprise Data Models aim to provide a model of the key data items passing through an organisation.
Often they feature as activities early on in data programmes as a critical artifact to inform the design
and implementation of data architectures. This is a laudable aim, but adopting this approach leads to
these sorts of scenarios:
Long delays in delivering value to users:
An overemphasis on developing a monolithic EDM leads to it becoming an IT deliverable. This
incurs a long implementation period, leading to loss of business engagement and a slow down in
innovation. In the worst case the model is never finished.
The model is not adopted by developers:
The EDM takes too long to develop, so we also see the EDM being out of sync with the actual
implementation as teams work around this slow monolithic process and adopt their own niche models to
meet their needs. The model becomes a zombie artifact.
Recommendation
Avoid the development of an enterprise data model before data can be ingested into the system.
Rather, starting with the needs of the data users in the initial use cases will lead you to a useful
data model that can be iterated and developed over time. An overemphasis on developing a monolithic EDM
can often lead to it becoming an IT deliverable that incurs a long implementation period, leading to
loss of business engagement and a slow down in innovation.
Instead focus on building a Contributory Data Catalog that is built up as data of business value is
ingested and utilised. This will grow into a vibrant and high value EDM that provides value to the
business and enables faster innovation through increased trust in the data.
Centralising all data engineering in one function or team increases the distance from the users,
increases time to value and leads to these sorts of scenarios:
Data engineering is seen as a blocker rather than as an enabler
A focus on technology rather than business needs because the central
team lacks the domain context of how the data should be used.
Exhausted or demotivated engineering teams - the weight of an over
growing backlog and the feeling that they are never meeting the needs of the business leads to
frustration and demotivation in the centralised team.
Work seems to be prioritised according to data platform needs rather
than the data users.
Centralised teams make the organisation less responsive to change and are at a distance from the
concerns and needs of the data users. They usually lack the context to fully understand the data being
processed, so are less likely to spot issues in the data and have a simplistic view of data quality,
i.e. field must not be empty, rather than what good quality for an element of data actually looks like.
Recommendation
Avoid having a central function whose job is to service all needs for data users.
Instead, find ways of moving the work of providing data access closer to the end users - ideally ones
that enable self-service of data provision by engineers attached to data product teams. In the Team
Topologies world this might be an engineer in a stream aligned team which is creating a data product
using services provided by a data platform team.
Too many organisations have fallen foul of the problem where IT has bought a product to solve a problem
and it hasn't been adopted. The data area is no different. There is no shortage of vendors promising one
stop solutions for data platforms or promising that you can manage all your dashboards and data in one
place. But when they are implemented they fail to deliver the expected benefits.
The products have been well researched and are rated for example “best of breed” or “Technology Leaders”
by respected technology evaluators - so why haven't they been embraced by the organisation? Meeting
business needs, and so responding to market conditions or users needs, is put on the back burner while
the technology initiative is implemented. Time and time again we see these initiatives take longer and
cost more than expected because the focus is not on meeting a business need. Instead they get tied up in
completing all the implementations around meeting technical milestones unconnected with users before
even considering a valuable use case.
Recommendation
Strategies should start by understanding the business problems you want to address with your data and
they should consider people and processes as well as the technology. Understand how the business wants
to work (process), how people will use the data (People ) you can then ensure the products (technology)
being chosen will meet the business needs. Of course, there are many great off the shelf products out
there, which can be very beneficial. But they will not be the full solution. Before you commit to one,
understand its boundaries and what skills (People - again!) you will need to make it work. We really
recommend doing some technical spikes to get a feel for what the product can do and how you can include
it in your development lifecycle (it's easy to forget about this point) before making your choice.
Environments in which the data pipelines can be constructed using visual programming approaches, such as
drag and drop of components onto a canvas, have made it easier for non-coders to create data pipelines
and democratise data.
Why should you avoid it?
Whilst we applaud the goals of these platforms to improve self-service of data pipelines, they often
create challenges longer-term. The Continuous Delivery movement has identified a number of drivers for
accelerating delivery and operation of software - practices such as Test Driven Development and Test
Automation for deployment, working in small batches, and infrastructure as code.
Most Wysiwyg platforms are not developed with these approaches in mind. They are difficult to integrate
into CI/CD/TDD approaches and infrastructure. For example, they are typically not provisioned with an
ability to create unit-tests, making continuous deployment and low-risk upgrade or maintenance
difficult. They can be difficult to place under version control and when they are, it is typically not
possible to see the changes between commits. In some cases monitoring and alerting are not easy to
integrate with the platforms.
These challenges make it difficult to maintain trustworthy pipelines. In our experience it is almost
impossible to maintain the generated code that these tools produce. Whilst a simplistic pipeline is easy
to demonstrate, it is remarkable how many projects using these tools still require specialised
consultants to be involved long after the initial setup period.
Recommendation
We prefer to create our pipelines in code as this allows us to benefit from all the Continuous Delivery
development techniques that software engineers have discovered are the most efficient ways to create
high quality, trustworthy software. However, if you do choose to go down the wysiwyg route, we have
found that these tools require more Quality Assurance time and effort as testing is shifted right. You
typically cannot version control to the same degree as you can with a code solution, but try to find
opportunities to apply it where you can. For example, we have found that Infrastructure as Code using
tools like Terraform can be applied fruitfully and tools like Liquibase can be used to manage the SQL
(which you will almost certainly need to work with) and automate some of the QA testing.
Data Lakes are highly scalable data storage areas where data in many different formats - unstructured,
semi-structured (e.g json files) or structured (e.g. parquet files.). Because they are simple storage
areas it can be very simple to ingest data which is attractive, but it can be tempting to ingest lots of
data with an ‘if we collect it they will come’ mentality and it will often also lack key features which
make data usable such as discoverability or appropriate partitioning - all of which leads to the
dismissal of Data Lakes as ‘Data Swamps.’
Data is the lifeblood of business. A data lake or data warehouse is a way of storing some of that
business data but the focus needs to be on the business and its requirements, not the building of the
store or the ability to swallow massive amounts of data.
Recommendation
Focus on the business requirements and use them in conjunction with curated pipelines or distributed SQL
tools.
Many of the failings of data lakes are not about the chosen technology or the designs that have been
implemented, it's that they are built with a focus on ingesting large amounts of data rather than
providing data for end users. So instead of focusing on building a data lake, focus on delivering a
domain focused data platform, with use-cases for the data that meet the needs of the business and can be
built out.
We are not saying never use Data Lakes - we have seen them successfully used as part of an ELT (extract
load transform) architecture as the landing zone to drop raw data from source systems - the so-called
Lakehouse architecture. They can also be a great choice if the data is of the same structure and you can
apply tools like Presto or AWS Athena to provide querying and discoverability services.