- May 12, 2022
- 6 minutes
- Michal Rachtan
Having high-quality and easily accessible data is increasingly a necessity for organizations to stay competitive and create business value. We rely on data to drive many important decisions; from optimizing live production processes to forecasting demand in the manufacturing supply chain and ultimately to the way goods are ordered and distributed to the consumers. With the recent exogenous shocks such as pandemics, war, and supply chain disruptions, the need for timely and accurate information is more important than ever before.
This is not a simple problem. Poor data quality can have some catastrophic consequences. A striking example is the British National Health Service NHS who lost thousands of Covid test results in 2020, setting back the UK’s fight against COVID-19. This disastrous error came about because the NHS used a legacy software solution (Excel) as part of their data pipeline, which was terribly suited to solving large-scale data problems.
Such data errors and data outages can severely impact the quality of decisions made. Yet still, we see many organizations, including large ones, struggling with three aspects of data operations activities:
(i) inadequate and/or incomplete data architecture
As was clearly shown in the previous NHS example, many organizations lack the complete toolset to build reliable data pipelines. Still, a number of data lake/platform products focus on providing only basic storage, compute, and scheduling functionality, neglecting many vital aspects of maintaining high-quality data assets.
(ii) insufficient in-house skills
The level of experience required to operate a modern data stack is not negligible, and many organizations struggle to bridge this talent gap. Cloud-native platforms are composed of many separate services working together in a coordinated fashion to provide a complete platform functionality – i.e. authentication, access management, job scheduling, data quality monitoring, etc. This complexity requires a top engineering team to manage and make sense of it.
(iii) lack of standard data operating model and data governance
Most organizations approach data analytics topics opportunistically – build now and prove the value first. While this approach made sense as the starting point, with more teams involved in building data products, there is a stronger need to unify best practices and set ground rules across the organization.
The tech industry is naturally evolving to address those challenges by building better tools and best practices, such as DataOps to scale the data infrastructure to the size and complexity of these new challenges.
DataOps to the rescue
Data Operations, also known as DataOps, is an extensive collection of practices and tools that aim to ensure that every step of the data delivery is reliable and consistent. From the moment of collection, and ingestion to a data processing system, through the transformations to the moment the data gets exposed and finally consumed. These practices mean that every data consumer can reliably discover (i.e. through a data catalog), access, understand, and use the data. To achieve this, good DataOps practices cover four key aspects: discoverability, access, quality, and observability.
Data discoverability and access
Discoverability means that data can be cataloged, and then searched and found by the users. This need is usually addressed by data catalogs, and products similar to Collibra, Alation, or Atlan, which provide Google-like search capabilities and allow for collecting meta-data and curating information about datasets available in your organization. Some tools even go beyond that and allow managing access to the resources, by giving the users the possibility to trigger an access request. The data requests include the purpose of use, justification, and a time interval when access is required. After filling this follows an appropriate escalation and approval workflow ensuring proper security measures and access controls are respected. This process of adding the desired data assets into your “basket” is sometimes described as data shopping.
Data quality (DQ)
Detecting quality issues in data is both essential and tedious to implement. Good DataOps practice suggests that any data update and logic change in a pipeline should be covered with relevant tests, data quality, and health checks. That is, in a similar fashion to what you would do with any code being merged back to the production codebase. Key things to monitor are data freshness, value distribution (i.e. range of values, number of NULL values), schema validity, and data integrity – i.e. things like primary key uniqueness also called entity integrity or referential integrity of the foreign keys.
Unit8’s six pillars of data quality engineering
These may seem like simple or common-sense steps that could be easily skipped, but they can mitigate a large portion of data errors and data outages through consistent application across all data pipeline steps and runs. Fortunately, declarative approaches to data quality do exist, and open source projects such as Great Expectations make it easy to apply.
Data observability
Often upstream dataset dependencies can get lost in the complex data pipeline, making the fail points hard to find and diagnose. This lack of transparency can slow down problem-solving and lead to long data outages, as we are unsure what processing steps have even happened or how the data was sourced. Data lineage allows you to discover all downstream and upstream dependencies of a particular dataset, and immediately find out how the table was built, what input data were used to do it, and what downstream consumers could be potentially impacted by any problems with the table. This is one of these core capabilities you absolutely need to integrate into your data stack in order to understand and troubleshoot the root cause of data quality issues.
Monocle, a data lineage tool of Palantir Foundry
Data observability practices and tools ensure that important information about the state of the data pipeline is collected and available before a critical error even occurs – for example, things like data provenance, pipeline run time statistics, logs, and traces. Combined with automated monitoring and alerting, this approach allows to dramatically reduce troubleshooting time, and in turn “data downtimes”.
Technology market for DataOps
As a natural evolution of the DataOps trend, the technology market is there to address the new needs of the consumers. We observe two key ways the market responds to that:
(i) “all-in-one” unified analytics platforms
These software products aim to “do it all” – from data prep to machine learning, MLOps, visualization, and DataOps. Having everything bundled into one platform allows democratization of data use and easy collaboration. These well-integrated ecosystems, such as Palantir Foundry and Dataiku enable different groups of users, technical and business-oriented, to work together on the same platform. These catch-all types of products are easy to set up and ready to perform a variety of tasks expected from the enterprise-grade data analytics stack. The key DataOps functionalities like data linage, data quality & health checks, and data protection controls are also well-integrated and baked into those systems. However, platforms like that might be harder to expand, so in a way, you trade some freedom and flexibility (i.e. choice of programming language, data processing framework, database type) for the simplicity of an integrated data ecosystem.
(ii) unbundled solutions
This group of products aims to functionally cover all the different stages of the data journey – from extraction (Fivetran, Airbyte) to data warehousing (Snowflake, Databricks) to business intelligence (Metabase, Lightdash) to operations DataOps (dbt, Monte Carlo Data, Bigeye, Databand). The products offer the market a coherent set of solutions that integrate with one another. In theory, this means you can utilize more specialized tools and be able to expand without being thigh down to a single closed-box solution. However, the engineering effort required to integrate and operate similar solutions is not negligible. We also see a lot of acquisitions and consolidations happening in this space, with bigger players being eager to spend to complement their bundled solutions – Databricks acquires Redash, Snowflake acquires Streamlit. The dynamism of this product space and the level of expertise required for integration might mean that this path is not suitable for a big enterprise customer.
Technology is not enough!
But, having the correct technology package isn’t enough. You need to combine it with the right skills and practices to deliver your data insights truly reliably. One way to make this possible is to approach data operations as if it is a software problem, that is to:
- apply automation and elimination to everything that is time-consuming and repetitive
- use specialist knowledge to analyze data pipelines, proactively eliminate weaknesses and potential failure incidents
- define a set of practices leading to higher reliability rather than focus on what’s strictly necessary – for example, define platform-wide standards for data quality & pipeline health checks
Is this a task for your production support team? Not quite. The modern data stack is an inherently complex system composed of tens of micro-services and distributed processing frameworks. Your team needs to have the deep skills and time necessary to do extensive triaging and problem-solving, and at the same time build and introduce common good practices to the rest of the organization and the product teams. This new type of team is sometimes called Data Reliability Engineering (DRE). It’s a sister discipline to Site Reliability Engineering (SRE), a term coined by Google to describe their approach to infrastructure operations, but in this case, DRE is related to the world of data.
The layout of capacity allocation for DRE teams
DREs usually divide their time 50%:50% between operations activities such as on-calls and project work leading to a reduction of overhead, also called “toil”. The project work is about finding and fixing these root issues, developing better methods of coping with those failures, and building tooling and automation that supports all of that. This means that the DataOps processes become more and more refined and, in turn, more reliable over time.
A correct balance between operations and project work is essential for DRE teams. It makes all the difference between healthy teams and unsustainable engineering and burned-out crews.
So now you know how to ensure you have reliable data at the right time and make the right decision even through times of crisis. You need the right technology, skilled staff, and a crystal clear strategy for your data operations. But it can feel like a zoo out there with all of these software options, team roles, and systems. Trying to assemble it yourself can be challenging, and can take a long time before you set up the right technology package, hire skilled staff, and implement good DataOps practices. Finding a competent and reliable technology partner might be just the right move to bootstrap and accelerate your efforts to make your data more dependable.