Skip to main content
data qualitydata governancedata quality dimensionsdata quality management

What Is Data Quality?

Data quality measures how well data serves its intended purpose. High-quality data is accurate, complete, consistent, and timely enough to support the decisions, analytics, and AI models that depend on it. Low-quality data produces wrong answers, broken dashboards, and AI hallucinations — regardless of how sophisticated the tools that consume it are.

Data quality is not a binary state. A dataset can be perfectly adequate for trend analysis but fatally flawed for regulatory reporting. Quality is always measured relative to a specific use case, which is why data quality management is inseparable from broader data governance: you need clear ownership, defined standards, and agreed-upon business definitions before you can meaningfully measure whether data meets them.

TL;DR

Data quality measures how well data serves its intended purpose across six dimensions: accuracy, completeness, consistency, timeliness, uniqueness, and validity. It sits under the data governance umbrella and is increasingly becoming a native capability of data platforms like Databricks, Microsoft Fabric, and Snowflake. Data catalogs are evolving to integrate with these platform-native DQ capabilities rather than replacing them.

The Six Dimensions of Data Quality

Data quality is typically evaluated across six core dimensions. Each dimension captures a different aspect of what makes data fit for use, and different business contexts weight these dimensions differently.

SIX DIMENSIONS OF DATA QUALITYAccuracyDoes the data reflectthe real world?CompletenessAre all required valuespresent?ConsistencyDo values agree acrosssystems?TimelinessIs the data currentenough?UniquenessAre there no unwantedduplicates?ValidityDoes data conform todefined formats & rules?
Click to enlarge

Accuracy

Accuracy measures whether data values correctly represent the real-world entities or events they describe. A customer address that still lists a street that was renamed three years ago is inaccurate. A product price that is off by one decimal place is inaccurate. Accuracy is the most intuitive dimension, but also one of the hardest to measure at scale because it requires comparison against a trusted source of truth.

Completeness

Completeness measures whether all required data values are present. A customer record missing an email address is incomplete. A financial transaction without a timestamp is incomplete. Completeness thresholds depend on context: a 95% fill rate for optional fields might be acceptable, while a 99.9% fill rate for mandatory regulatory fields might not be enough.

Consistency

Consistency measures whether the same data values agree across different systems, databases, and reports. When the CRM says a customer has 12 active licenses and the billing system says 14, one of them is inconsistent. Consistency problems are among the most common data quality issues in enterprises with multiple source systems, and they erode trust faster than almost any other quality failure.

Timeliness

Timeliness measures whether data is available and current enough for its intended use. A stock price that is 15 minutes old is timely for trend analysis but stale for trading. A customer churn prediction model that runs on monthly data might miss customers who are leaving right now. Timeliness requirements vary dramatically by use case.

Uniqueness

Uniqueness measures whether each entity is represented only once in a dataset. Duplicate customer records, duplicate transactions, and duplicate product entries cause inflated metrics, redundant communications, and incorrect aggregations. Entity resolution — the process of identifying and merging duplicate records — is one of the most technically challenging aspects of data quality management.

Validity

Validity measures whether data conforms to defined formats, types, and business rules. A date field containing "32/13/2025" is invalid. An age field with a negative number is invalid. A country code that does not exist in ISO 3166 is invalid. Validity checks are the most automatable dimension of data quality because they can be expressed as deterministic rules.

Why Data Quality Matters

Every downstream system inherits the quality of its input data. Poor data quality does not just create wrong numbers; it creates wrong decisions, broken trust, and wasted resources.

Poor data quality costs organizations an average of $12.9 million per year. The largest component is labor wasted on finding, correcting, and reconciling data that should have been governed at the source.

— Gartner, How to Improve Your Data Quality

AI and machine learning

AI models are only as good as the data they learn from. A recommendation engine trained on product data with inconsistent categories will produce nonsensical suggestions. A fraud detection model trained on data with duplicate transactions will learn false patterns. Data quality is the single largest predictor of AI project success or failure — organizations that invest in data quality before investing in AI consistently outperform those that do not.

Analytics and reporting

When business leaders cannot trust the numbers in their dashboards, they either make decisions on gut instinct (defeating the purpose of analytics) or waste hours manually verifying data (defeating the purpose of automation). Trust in analytics is built on data quality, and once lost, it takes significant effort to rebuild.

Regulatory compliance

Regulations like GDPR, BCBS 239, and SOX impose specific requirements on data accuracy, completeness, and lineage. Submitting inaccurate regulatory reports carries financial penalties and reputational damage. Data quality management provides the evidence trail that compliance requires.

Operational efficiency

Data quality issues propagate through systems and workflows. A single incorrect product code can cascade through inventory management, order fulfillment, financial reconciliation, and customer communication. The cost of fixing data quality at the source is a fraction of the cost of fixing its downstream effects.

Data Quality Under the Data Governance Umbrella

Data quality does not exist in isolation. It is one of the core pillars of data governance — the organizational framework of policies, roles, and standards that makes data trustworthy. Without governance, data quality efforts are tactical: teams fix problems as they find them, without addressing the systemic causes.

DATA QUALITY WITHIN DATA GOVERNANCEData Governance FrameworkData QualityMetadata MgmtLineageOwnership • Policies • Business Glossary • ComplianceData quality depends on governance context: who owns the data, what standards apply, and what “correct” means
Click to enlarge

Governance provides the context that makes data quality actionable:

  • Ownership — governance defines who is accountable for the quality of each data domain. Without clear ownership, quality issues become everyone's problem and nobody's responsibility.
  • Business definitions — a business glossary establishes what each term means, which determines what "accurate" and "consistent" look like. You cannot measure consistency between two systems if you have not agreed on what a "customer" or "revenue" is.
  • Data lineage — when a quality issue is found, lineage tells you where the problem originated, what systems are affected, and where to fix it. Without lineage, quality remediation is guesswork.
  • Policies and standards — governance defines the acceptable quality thresholds for different data classes. Financial data might require 99.99% accuracy; marketing analytics might tolerate 95%.

This is why organizations that treat data quality as a standalone technical initiative — deploying quality tools without governance — consistently struggle with adoption and sustainability. Quality tools can detect problems, but only governance can prevent them.

The Platform-Native DQ Shift

A significant trend is reshaping the data quality landscape: leading data platforms are building data quality directly into their core functionality. Rather than treating data quality as an external concern handled by separate tools, platforms like Databricks, Microsoft Fabric, and Snowflake now offer native DQ capabilities as part of their data processing and storage engines.

Databricks

Databricks has introduced Lakehouse Monitoring and Delta Live Tables expectations, enabling teams to define data quality rules directly within their data pipelines. Quality checks run as part of the data processing workflow — not as a separate post-processing step. When a quality rule fails, the pipeline can quarantine bad records, alert data owners, or halt processing entirely. This approach treats data quality as a first-class concern within the lakehouse architecture.

Microsoft Fabric

Microsoft Fabric integrates data quality capabilities across its unified analytics platform. Data quality rules in Fabric can be defined at the source level and enforced automatically as data flows through lakehouses, warehouses, and semantic models. Fabric's integration with Microsoft Purview adds governance context — quality scores, business definitions, and lineage — directly within the tools that data teams already use daily.

Snowflake

Snowflake has expanded its platform with data quality monitoring functions and data metric functions that allow teams to define, schedule, and track quality metrics natively within the Snowflake environment. Quality monitoring operates on live data without requiring data movement to external tools, reducing latency and simplifying architecture.

What this shift means

Platform-native DQ capabilities change the economics and architecture of data quality management. Instead of extracting data samples, sending them to a separate quality tool, and then acting on the results, teams can now embed quality checks directly where data is stored and processed. This reduces integration complexity, lowers latency, and makes data quality a continuous part of the data lifecycle rather than a periodic audit.

For organizations building their data quality strategy today, the implication is clear: the platform where your data lives is increasingly where your quality checks should run too.

How Data Catalogs Approach Data Quality

As data quality becomes a native capability of data platforms, data catalogs have evolved their approach. Two distinct strategies have emerged in the market, each reflecting a different philosophy about where data quality responsibility should sit.

DATA CATALOG APPROACHES TO DATA QUALITYOpen DQ MetadataIntegrate, don't replaceCatalog ingests DQ scores & rules fromplatform-native tools or specialized DQengines (Great Expectations, Soda, etc.)Unified view of quality across platformswithout duplicating DQ executionExamples: Dawiso, AlationBuilt-in DQ EngineOwn the quality stackCatalog includes its own DQ profiling,rule definition, monitoring, andremediation capabilitiesSingle vendor for governance + qualitybut may overlap with platform DQExamples: Collibra, Informatica
Click to enlarge

Open DQ metadata: integrate, don't replace

Catalogs like Dawiso and Alation take an open DQ metadata approach. Rather than building their own data quality engine, they integrate with the quality capabilities that already exist — whether those are platform-native functions in Databricks, Fabric, or Snowflake, or specialized DQ tools like Great Expectations, Soda, or Monte Carlo.

The catalog ingests quality scores, rule results, and anomaly alerts from these sources and presents them alongside other metadata: business definitions, ownership, lineage, and usage patterns. This gives data consumers a unified view of data quality across platforms without duplicating the quality execution logic. When a data steward opens an asset in the catalog, they see not just what the data is and where it comes from, but how healthy it is — regardless of which platform or tool produced that assessment.

This approach aligns with the platform-native DQ trend. As Databricks, Fabric, and Snowflake invest heavily in their own quality capabilities, it makes more sense for catalogs to integrate with those investments than to compete against them.

Built-in DQ engine: own the quality stack

Catalogs like Collibra and Informatica take the opposite approach: they include their own data quality profiling, rule definition, monitoring, and remediation capabilities as part of the platform. Organizations using these tools define and execute quality rules within the catalog itself, creating a tightly integrated governance-and-quality experience from a single vendor.

This approach offers simplicity — one platform for both governance and quality — but it can create overlap with platform-native DQ capabilities. An organization running Databricks with Delta Live Tables expectations and Collibra's DQ engine may end up with two separate systems checking quality on the same data, each with its own rules, scores, and alert mechanisms.

Where the industry is heading

The trend favors integration over replacement. As data platforms mature their native DQ functions, the value proposition of a standalone DQ engine inside a catalog diminishes. What remains uniquely valuable is the catalog's ability to aggregate and contextualize quality signals from multiple sources — showing data consumers the full picture of an asset's health alongside its business meaning, lineage, and usage.

The emerging pattern is a three-layer architecture: data platforms handle DQ execution (rules, checks, monitoring), specialized DQ tools handle advanced use cases (entity resolution, ML-based anomaly detection), and data catalogs serve as the integration and presentation layer that ties quality signals to governance context.

Common Data Quality Challenges

Even organizations with mature governance programs encounter recurring data quality challenges. Understanding these patterns helps teams prioritize their quality investments.

Quality degrades over time

Data quality is not a one-time achievement. Source systems change, business rules evolve, new data sources are onboarded, and manual entry errors accumulate. Without continuous monitoring, a dataset that was clean at launch can degrade significantly within months. Data quality must be treated as a continuous process, not a project with a completion date.

Root cause is often upstream

Quality problems are usually symptoms of upstream issues: missing validation in source applications, poorly defined integration mappings, or unclear ownership of shared data domains. Fixing data quality at the point of consumption — cleaning data in reports or dashboards — is expensive and unsustainable. Effective quality management addresses issues at their source.

Measuring quality requires business context

You cannot write a quality rule without knowing what "correct" looks like, and "correct" is a business decision, not a technical one. Is a customer address valid if the postal code is correct but the street name is misspelled? The answer depends on whether the address is used for shipping, marketing, or compliance. This is why data quality and data governance are inseparable: governance provides the business context that quality measurement requires.

Tool sprawl

Organizations often accumulate multiple overlapping quality tools: one embedded in the ETL platform, another in the data warehouse, a third in the BI layer, and a fourth purchased as a standalone DQ product. Each tool produces its own scores, uses its own rule syntax, and reports to its own dashboard. The result is fragmented quality visibility and duplicated effort. A data catalog with open DQ metadata integration can consolidate these signals into a single view.

Conclusion

Data quality is the foundation on which analytics, AI, and business decisions are built. Without it, even the most sophisticated tools produce unreliable results. As a core pillar of data governance, data quality management is most effective when it has clear ownership, agreed-upon business definitions, and systematic monitoring.

The data quality landscape is shifting. Leading data platforms are embedding quality capabilities directly into their processing engines, making DQ a native function rather than an afterthought. Data catalogs are adapting by integrating with these platform-native capabilities — aggregating quality signals from multiple sources and presenting them alongside governance context. The result is a more efficient architecture where quality checks run where data lives, and catalogs provide the unified view that data consumers need to trust what they see.

Dawiso
Built with love for our users
Make Data Simple for Everyone.
Try Dawiso for free today and discover its ease of use firsthand.
© Dawiso s.r.o. All rights reserved