Skip to main content
dbtdatabricksdata transformationanalytics engineering

Databricks + dbt: How the Modern Transformation Stack Works

dbt turns SQL SELECT statements into tested, versioned, documented transformation pipelines. Databricks provides the compute engine and Delta Lake storage. Together, they give analytics engineers a workflow where transformations are code-reviewed like software, tested on every run, and automatically documented.

This combination has become the default architecture for teams that want to apply software engineering practices to data without leaving SQL. You write a SELECT, dbt compiles it into DDL, and Databricks executes it at scale. The result is a Delta table registered in Unity Catalog, ready for downstream BI tools, ML models, or reverse ETL.

TL;DR

dbt handles transformation logic (SQL models, tests, documentation) while Databricks provides the compute engine and Delta Lake storage. Analytics engineers write SELECT statements, dbt compiles them into DDL, and Databricks executes them at scale. The stack supports incremental processing, automated testing, CI/CD pipelines, and auto-generated documentation. Use dbt Cloud for managed execution or dbt Core (open source) with Databricks SQL warehouses.

Why Databricks + dbt Became the Default Stack

Before dbt, most data teams transformed data using custom Python scripts orchestrated by Airflow. The scripts were hard to test, rarely documented, and difficult for new team members to read. When something broke at 3 AM, debugging meant tracing through hundreds of lines of PySpark or pandas code.

dbt changed the workflow by making SQL the unit of transformation. Each model is a SELECT statement in its own file. Dependencies between models are declared with ref() calls, and dbt resolves them into a directed acyclic graph (DAG) automatically. Tests are defined in YAML alongside the models. Documentation is generated from the same YAML and the SQL comments.

Databricks changed the compute layer by providing SQL warehouses — serverless, auto-scaling endpoints optimized for SQL workloads. Instead of spinning up Spark clusters and managing JVM configs, analytics engineers point dbt at a SQL warehouse and let Databricks handle the rest. The results land as Delta Lake tables with ACID transactions, time travel, and schema enforcement built in.

The combination removed the need for separate orchestration and transformation tools. A data team that previously maintained 40 Airflow DAGs with custom Python operators can replace them with 200 dbt models running on a single SQL warehouse — with better testing, better documentation, and lower operational overhead.

Architecture: How the Pieces Fit Together

The dbt + Databricks stack has four stages. Raw data arrives from source systems through ingestion tools like Fivetran or Airbyte and lands as staging tables in Delta Lake. dbt reads those staging tables, applies transformations organized into staging, intermediate, and mart layers, and writes the results back as new Delta tables or views. A SQL warehouse executes every statement. Unity Catalog governs access and metadata throughout.

DBT + DATABRICKS TRANSFORMATION PIPELINEDatabasesAPIsFiles / S3RAW SOURCESIngestionFivetran / AirbyteDelta LakeStaging tablesUnity Catalogdbt Transformstg_ (staging)int_ (intermediate)fct_ / dim_ (marts)BI ToolsML ModelsReverse ETLSERVINGGovernance LayerDawiso catalog + business glossary + cross-platform lineage
Click to enlarge

The model layering pattern keeps transformations organized. Staging models (stg_) rename columns, cast types, and clean raw data — one model per source table. Intermediate models (int_) join staging models and apply business logic. Mart models (fct_, dim_) produce the final, analysis-ready tables organized by business domain.

dbt manages the dependency order automatically. When you run dbt build, it compiles every model's SELECT into DDL, resolves the DAG, and executes statements in the correct sequence. If a staging model changes, dbt knows which intermediate and mart models depend on it and rebuilds them too.

Materializations on Databricks

dbt offers four materialization strategies, and the right choice depends on table size, query patterns, and freshness requirements.

View is the lightest option. dbt creates a SQL view — no data is stored, no rebuild is needed, and the result is always fresh. Views work well for staging models and simple transformations. The trade-off is that every query re-executes the underlying SQL.

Table materializes the full result as a Delta table on each run. Query performance is fast because the data is pre-computed, but rebuild time scales with data volume. Tables suit mart-layer models under roughly 10 million rows.

Incremental processes only new or changed rows. dbt adds a WHERE clause filtering on a timestamp or ID column and merges the results into the existing table. This is essential for large fact tables — a 500-million-row events table that takes 45 minutes as a full table rebuild might take 2 minutes incrementally.

Snapshot captures Type 2 slowly changing dimensions. dbt compares the current state of a source table against its previous snapshot and creates historical records with valid_from and valid_to timestamps. Use snapshots when you need to track how dimension attributes change over time — for example, a customer's pricing tier or an employee's department.

MATERIALIZATION DECISION TREEIs the table large (>10M rows)?NoYesSimple transformation?YesNoViewNo storage costTableFull rebuildHas reliable timestamp?YesNoIncrementalProcess new rowsTable + PartitionPartition by dateNeed historical tracking? Use Snapshot (Type 2 SCD)
Click to enlarge

A practical rule: switch to incremental when your table exceeds 10 million rows or when a full rebuild takes more than 5 minutes. Below that threshold, the added complexity of incremental logic (merge keys, timestamp filters, late-arriving data handling) usually isn't worth it.

Testing and Data Quality

dbt's testing framework runs assertions against your data after every build. Tests are defined in YAML schema files alongside the models they validate.

Built-in schema tests cover the most common checks: unique (no duplicates on a key column), not_null (required fields are populated), accepted_values (status fields contain only known values), and relationships (foreign keys reference existing records in the parent table). You declare them in a few lines of YAML, and dbt generates the SQL to verify them.

Custom data tests handle business-specific rules. You write a SQL query that returns rows violating the rule. If the query returns any rows, the test fails. For example, a test catching orders with negative revenue:

SELECT order_id, revenue FROM fct_orders WHERE revenue < 0

This test caught a real bug at one team running the stack: 2% of orders showed negative revenue because refund records were being double-counted during a migration from the old ERP. The test flagged it on the first production run, before any dashboard showed the corrupted numbers.

Tests run as part of dbt build. A failure stops the pipeline and prevents downstream models from materializing. This is the key difference from traditional ETL, where tests — if they exist — are usually a separate, disconnected step that nobody checks until something visibly breaks.

Over 40,000 companies use dbt in production, with the median project running 150+ tests per build. Teams that adopt dbt's testing framework report catching data quality issues 5x faster than with manual validation processes.

— dbt Labs, The Next Layer of the Modern Data Stack

Performance Optimization

dbt handles the transformation logic, but performance depends on how the underlying Delta tables are structured and how the Databricks SQL warehouse is configured.

OPTIMIZE and Z-ORDER are the highest-impact optimizations. OPTIMIZE compacts small files into larger ones, reducing the number of file reads per query. Z-ORDER co-locates related data within files based on column values — if dashboards always filter by region and date, Z-ordering on those columns can reduce scan times by 50-80%. Run OPTIMIZE weekly on tables over 100 GB; use Z-ORDER on your most-filtered columns.

Liquid clustering is the newer alternative to Z-ordering and partitioning. It handles data layout automatically and re-clusters incrementally as new data arrives. Use it for new tables; for existing tables with established partition schemes, migration requires a full rewrite.

SQL warehouse sizing matters more than most teams realize. A warehouse that's too small causes queries to queue and spill to disk. A warehouse that's too large wastes DBU spend. Start with a Medium warehouse, monitor query times and queue depth in the SQL warehouse history, and scale from there. Enable Photon — it accelerates SQL execution through a C++ engine that runs alongside Spark and typically cuts query times by 30-50%.

Incremental model strategies also affect performance. The merge strategy handles upserts cleanly but requires a full table scan on the target side. The append strategy is faster but only works for immutable event data. For very large tables, combine incremental models with partition pruning: filter the source data to only the partitions that could contain new records.

CI/CD and Deployment

dbt projects use the same Git workflow as application code: feature branches, pull requests, code review, and automated testing before merge.

DBT CI/CD PIPELINEDeveloper writes SQL modelGit push to feature branchCI: dbt build --select state:modified+PASSFAILFix and re-pushPR review + mergeProduction dbt build on SQL Warehouse
Click to enlarge

The CI step runs dbt build --select state:modified+, which builds only the models that changed and their downstream dependents. This keeps CI fast — a project with 300 models might only rebuild 5-10 during a typical PR. Tests run as part of the build. If any test fails, the PR is blocked.

Environment strategy separates concerns cleanly. Each developer gets their own schema (e.g., dev_jane) so experiments don't affect other team members. A staging environment runs the full build against production-scale data as a final QA step. The production environment serves the tables that BI tools and other consumers read from.

For execution, teams choose between dbt Cloud (managed scheduling, built-in IDE, job history) and dbt Core with an external orchestrator like Airflow or Dagster. dbt Cloud is simpler to operate. dbt Core gives more control over the execution environment and works better for teams already running Airflow for non-dbt workloads.

Documentation, Lineage, and Dawiso Integration

dbt generates documentation and lineage graphs from model references. Running dbt docs generate produces a searchable website showing every model, its columns, descriptions, tests, and the DAG of dependencies. The persist_docs config pushes these descriptions into Databricks table and column comments, making them visible in Unity Catalog's UI.

But dbt documentation has a scope limitation: it only knows about what's inside the dbt project. It doesn't see the upstream ingestion pipelines, the downstream BI reports, the Excel files someone uploads manually, or the other data platforms in the organization. For a team running dbt alongside Snowflake, a legacy Oracle warehouse, and 50 Power BI dashboards, dbt docs cover maybe 30% of the data estate.

Databricks' Unity Catalog and dbt docs share a common lineage model, but neither extends beyond its own ecosystem. Organizations with multi-platform environments need a cross-platform catalog to close the visibility gap.

— Databricks, Simplified Analytics Engineering with Databricks and dbt Labs

This is where Dawiso fits. The data catalog indexes dbt models alongside non-dbt sources, BI reports, and external datasets. The business glossary maps dbt model columns to business terms — so "fct_orders.revenue" is linked to the canonical definition of "revenue" used across the organization. Data lineage traces the full path: source database to ingestion tool to dbt staging model to mart table to Power BI dashboard.

Through the Model Context Protocol (MCP), AI agents can query Dawiso for column definitions, data freshness, lineage, and metric ownership — all through a standardized protocol. An AI copilot helping an analyst write a dbt model can look up whether a column exists, how it's defined, and which downstream reports depend on it, without the analyst leaving their IDE.

Conclusion

The dbt + Databricks stack works because it meets analytics engineers where they are: in SQL. dbt provides the software engineering practices — version control, testing, documentation, CI/CD — and Databricks provides the compute and storage. The result is transformation pipelines that are faster to build, easier to debug, and possible to hand off to new team members without tribal knowledge. The remaining gap — cross-platform visibility — closes when a catalog like Dawiso connects dbt's project-scoped documentation to the broader data estate.

Dawiso
Built with love for our users
Make Data Simple for Everyone.
Try Dawiso for free today and discover its ease of use firsthand.
© Dawiso s.r.o. All rights reserved