What Is Data Integrity?
Data integrity is the property that data remains accurate, consistent, and unaltered across its lifecycle — from creation through storage, transformation, and consumption. It is the assurance that the data a downstream consumer reads matches the truth at the source, has not been corrupted, has not been silently changed by unauthorized actors, and continues to satisfy the structural and semantic rules that make it usable.
Data integrity is one of the three pillars of the classic CIA triad from information security (Confidentiality, Integrity, Availability) — but it has a longer history in database theory, where it predates the security framing by decades. Edgar F. Codd's 1970 paper that defined the relational model already discussed entity and referential integrity as fundamental properties of correct relational databases. The concept has expanded steadily as data infrastructure has grown more complex: today data integrity covers everything from database constraint enforcement to cryptographic verification of records to lineage-based validation of analytical outputs.
Data integrity is the property that data stays accurate, consistent, and unaltered through its lifecycle. Four classic types — entity integrity (unique primary keys), referential integrity (valid foreign keys), domain integrity (values within allowed ranges), and user-defined integrity (custom business rules) — sit at the database layer. Modern data ecosystems extend this with cryptographic, transactional, and pipeline-level integrity guarantees. Distinct from data quality (a broader fitness-for-use concept), data integrity is preserved by constraint enforcement, ACID transactions, lineage validation, and audit trails — and it is a hard requirement under regulations like BCBS 239, FDA 21 CFR Part 11, and SOX.
Data Integrity Defined
The crisp definition: data integrity is the maintenance of, and assurance of, the accuracy and consistency of data over its entire lifecycle. The two words carry weight.
- Accuracy — The data reflects the real-world fact it is supposed to represent. A customer record with a misspelled name has impaired accuracy.
- Consistency — The data is structurally and logically coherent, both within itself (no internal contradictions) and against the rules that define valid data in its context. A customer record with a birthdate after the order date violates consistency.
Critically, data integrity is concerned with preservation, not initial correctness. Data can have integrity even if the source data was wrong — what integrity ensures is that the original value (whether correct or not) is faithfully preserved through every storage, transmission, and transformation step. Integrity tells you "this is what was recorded"; it does not tell you "what was recorded is true." That second concern is closer to data accuracy and data quality as broader disciplines.
Four Types of Data Integrity
Relational database theory identifies four classical types of integrity, each enforced by specific mechanisms.
1. Entity integrity
Every row in a table can be uniquely identified. Implemented through primary keys with NOT NULL and UNIQUE constraints. Entity integrity guarantees that "the customer with ID 12345" refers to exactly one record, not zero and not several. Without it, downstream joins produce duplicates and aggregations produce wrong totals.
2. Referential integrity
Relationships between tables are valid. Implemented through foreign key constraints that ensure a reference to another table actually points to a row that exists. Referential integrity guarantees that order line 789 references customer 12345 only if customer 12345 actually exists. Without it, orphan records accumulate and joined queries silently lose data.
3. Domain integrity
Column values fall within the allowed domain — type, range, format, enumeration. Implemented through column types, CHECK constraints, NOT NULL constraints, and validation rules. Domain integrity guarantees that an "age" column doesn't hold negative numbers, an "email" column doesn't hold "N/A", and a "status" column doesn't hold values outside the defined set. Without it, downstream code accumulates defensive checks for cases that should never have entered the database.
4. User-defined integrity
Business rules that can't be expressed through standard constraints. "An employee's salary cannot exceed their manager's salary by more than 20%." "A return cannot be processed more than 90 days after the original order." Enforced by triggers, application logic, stored procedures, or external validation layers. The most flexible category and also the easiest to lose track of as systems evolve.
Integrity vs Quality vs Accuracy
Three closely related concepts get used loosely. The distinctions matter when designing controls.
- Data integrity — Preservation. Does the data remain unchanged and structurally valid through its lifecycle? Concerned with corruption, unauthorized modification, transformation correctness, and structural constraints.
- Data quality — Fitness for use. Does the data serve the consumer's purpose well? Broader concept; quality includes integrity but also accuracy, completeness, timeliness, uniqueness, validity, and consistency in the broader sense.
- Data accuracy — Truth correspondence. Does the data match reality? A subset concern of data quality. A customer record can have perfect integrity (faithfully preserved exactly as entered) and poor accuracy (the name was misspelled at entry).
An organization can have rigorous integrity and weak quality (the data hasn't been corrupted, but it was wrong to begin with and is now stale, incomplete, and inconsistent across systems). Or rigorous quality and weak integrity (regular cleansing keeps the data accurate, but pipeline bugs occasionally drop rows or duplicate them). Programs that address only one of the three tend to be surprised by the other two.
Threats to Data Integrity
Modern data ecosystems expose data to a wide range of integrity threats. The categories overlap, but each has distinct mitigations.
- Hardware-level corruption — Disk failures, bit rot, memory errors. Mitigated by RAID, checksums, error-correcting codes, replication, and cloud storage classes that handle this transparently.
- Software bugs in producers — Application code that writes inconsistent rows, misuses transactions, or violates business rules. Mitigated by database constraints, code review, integration testing, and contract testing between producer services.
- Pipeline transformation errors — Joins that lose rows, aggregations that double-count, schema drift that produces NULL columns. Mitigated by dbt tests, Great Expectations, row-count reconciliation, and lineage-driven impact analysis.
- Schema drift in sources — Source systems silently add, remove, or rename columns. Mitigated by schema contracts, automated drift detection, and quarantine policies for new fields.
- Replication and synchronization issues — Multi-region replication lag, CDC failures, eventual-consistency anomalies. Mitigated by ACID guarantees where available, idempotent processing, and reconciliation jobs that compare snapshots.
- Unauthorized modification — Direct database access bypassing application controls, insider tampering, ransomware encryption. Mitigated by least-privilege access, change auditing, write-once storage, cryptographic signing, and immutable backups.
- Incomplete or failed transactions — Partially applied updates that leave inconsistent state. Mitigated by atomic transactions (ACID), saga patterns for distributed transactions, and reconciliation.
- Human error in manual operations — DBAs running ad-hoc updates without WHERE clauses, analysts editing source data in Excel and re-uploading. Mitigated by removing direct production access, peer-reviewed scripts, and policy.
Preserving Data Integrity at Scale
Preserving integrity in a modern multi-system data landscape requires controls at every layer.
- Enforce structural constraints at the database layer. Primary keys, foreign keys, NOT NULL, CHECK, and UNIQUE constraints catch the most common integrity violations at the cheapest possible point. Modern lakehouse formats (Delta Lake, Iceberg) increasingly support these constraints natively, eliminating a historical gap.
- Use ACID transactions wherever available. ACID guarantees atomicity, consistency, isolation, and durability for individual operations. Lakehouses and modern warehouses provide them. Use them — and design pipelines so that failures leave clean states, not partial writes.
- Test transformations. dbt tests and equivalent frameworks validate that transformations preserve row counts, satisfy business rules, and maintain referential relationships. Tests run as part of the pipeline; failures fail the build.
- Track lineage and reconcile. Lineage tells you what should match between systems. Reconciliation jobs compare what does match and surface drifts as alerts. Mature data engineering treats reconciliation as a recurring health check, not a once-per-audit panic.
- Audit changes. Every mutation — schema changes, mass updates, security policy edits — produces an immutable audit log. Auditors need this. Incident responders need this. Compliance teams need this. Build it as a first-class output of the platform, not as a retrofit.
- Use immutable storage for evidence. Write-once-read-many (WORM) storage for regulatory records, point-in-time backups, and event logs that cannot be silently rewritten. Cryptographic signing or hashing of stored records for tamper-evidence.
- Minimize manual interventions. Every ad-hoc database update is a potential integrity violation. Mature environments move ad-hoc fixes to peer-reviewed scripts checked into version control and executed under audit.
Integrity in Regulated Contexts
Several regulatory regimes treat data integrity as a hard requirement with specific operational implications.
- BCBS 239 (Basel Committee, banking risk data) — Principle 3 explicitly requires accuracy and integrity of risk data, including reconciliation to authoritative sources and validation of aggregation. Banks under BCBS 239 invest heavily in integrity controls precisely because supervisors test them.
- FDA 21 CFR Part 11 (life sciences electronic records) — Mandates audit trails, electronic signatures, and integrity controls for electronic records used in regulated drug, device, and biologics work. The acronym ALCOA+ (Attributable, Legible, Contemporaneous, Original, Accurate, plus Complete, Consistent, Enduring, Available) summarizes the regulator's expectations.
- SOX (US Sarbanes-Oxley, financial reporting) — Requires CFOs and CEOs to certify the integrity of financial reporting, which depends on integrity of the data flowing into financial statements. SOX programs include detailed controls over financial data flows.
- GDPR Article 5(1)(d) (EU personal data) — Requires personal data to be "accurate and, where necessary, kept up to date" — an integrity-adjacent obligation tied to subject rights.
- DORA (EU financial sector) — Article 9 requires financial entities to maintain integrity, availability, and confidentiality of ICT data — backed by personal management accountability.
- HIPAA (US health records) — The Security Rule's "integrity" standard requires controls to ensure ePHI is not altered or destroyed in unauthorized ways.
In each of these regimes, integrity is not aspirational. It is testable, with specific controls regulators inspect. Organizations that have built integrity infrastructure for one regime (banks for BCBS 239, life sciences for Part 11) tend to find the same infrastructure satisfies large parts of the others.
Conclusion
Data integrity is the unglamorous foundation under everything else organizations do with data. Quality programs, analytics, machine learning, AI grounding, and regulatory compliance all assume integrity — and silently fail when the assumption is wrong. The classic database concepts (entity, referential, domain, user-defined integrity) remain the starting point, extended in the modern stack by ACID guarantees in lakehouses, lineage-driven reconciliation, immutable audit trails, and cryptographic verification where the stakes warrant it. The discipline rewards organizations that build the boring foundation early. It punishes the ones that defer it until a regulator, an auditor, or a customer-facing incident forces the conversation.
See it in action
Data & Analytics Catalog
Create a unified view of your data assets and gain insights faster with automated data discovery.