Skip to main content
databrickslakehousedata analytics

What Is Databricks? Architecture, Use Cases, and How It Fits the Modern Data Stack

Databricks is a cloud-based analytics platform built on Apache Spark that unifies data engineering, SQL analytics, and machine learning on a single runtime. At its core is the lakehouse architecture — a design pattern that combines the low-cost, flexible storage of a data lake with the reliability and performance features of a data warehouse. Data lives in open formats (Delta Lake on cloud object storage like S3, ADLS, or GCS), and workloads from batch ETL to real-time ML training run against that same data without copying it between systems.

What makes Databricks distinct is the combination of open formats, multi-cloud deployment (AWS, Azure, GCP), and a single platform for engineers writing Python pipelines, analysts running SQL, and data scientists training models. Organizations adopt it to replace the data lake + data warehouse split that forces duplicate data, duplicate ETL, and duplicate governance. For a detailed comparison with the main alternative, see Databricks vs. Snowflake.

TL;DR

Databricks is a cloud-based lakehouse platform that unifies data engineering, SQL analytics, and machine learning on a single runtime. It stores data in open formats (Delta Lake on cloud object storage), runs on AWS, Azure, and GCP, and uses Apache Spark for distributed processing. Organizations adopt it to consolidate separate data lakes and warehouses into one governed platform.

Lakehouse Architecture Explained

The traditional data architecture separates storage into two tiers. A data lake holds raw, unstructured, and semi-structured data at low cost but lacks ACID transactions, schema enforcement, and the query performance analysts expect. A data warehouse provides those guarantees but stores data in proprietary formats, charges premium prices for storage and compute, and cannot handle unstructured data like images, logs, or sensor readings.

The lakehouse pattern eliminates this split. Data lands in cloud object storage (S3, ADLS, GCS) in Parquet format. Delta Lake — an open-source storage layer — adds ACID transactions, schema enforcement, time travel (querying previous versions), and streaming support on top of those Parquet files. The result: one copy of data that supports warehouse-grade queries and lake-grade flexibility.

A concrete example: a retailer storing clickstream data, point-of-sale transactions, and inventory snapshots. In a traditional setup, clickstream goes to S3, transactions go to Redshift, and inventory feeds go to both. With a Databricks lakehouse, all three datasets land in a single Delta Lake. The data engineering team processes them with Spark, the analytics team queries them with SQL, and the data science team trains demand-forecasting models — all against the same governed data, without a single copy pipeline between systems.

LAKEHOUSE ARCHITECTURECloud Object StorageS3 / ADLS / GCS — Parquet files, low cost, unlimited scaleSTORAGEDelta LakeACID transactions · Schema enforcement · Time travel · Streaming supportRELIABILITYSQL AnalyticsBI queries, dashboardsData EngineeringETL, pipelines, SparkML / AITraining, MLflow, servingStreamingReal-time ingestionWORKLOADS
Click to enlarge

Core Components

Delta Lake is the storage layer that makes the lakehouse possible. It wraps Parquet files in a transaction log, giving you ACID guarantees on cloud object storage. You can roll back to previous data versions (time travel), merge streaming and batch data into the same tables, and enforce schemas so malformed records don't corrupt production datasets. Delta Lake is open source, meaning your data is never locked into Databricks — other engines can read it directly.

Apache Spark runtime handles distributed processing. Databricks maintains an optimized fork of Spark that includes the Photon engine — a C++ vectorized execution engine that accelerates SQL and DataFrame operations. For data engineering workloads, Spark processes anything from gigabytes to petabytes across autoscaling clusters. You write Python, SQL, Scala, or R; the runtime parallelizes execution across available nodes.

Unity Catalog is Databricks' governance layer. It provides centralized access control, audit logging, data lineage tracking, and discoverability across all tables, views, models, and files in the lakehouse. Unity Catalog enforces permissions at the table, column, and row level — critical for organizations in regulated industries. It also tracks which notebooks and jobs read from or write to each table, creating an automated lineage graph.

Databricks SQL gives analysts a familiar SQL interface to query Delta Lake tables. It supports serverless compute (no cluster management), integrates with BI tools like Power BI and Tableau through standard JDBC/ODBC connectors, and includes a built-in visualization layer for ad-hoc exploration. For teams connecting BI tools, see connecting Power BI to Databricks.

MLflow is an open-source platform for managing the ML lifecycle: experiment tracking, model versioning, model registry, and deployment. Databricks integrates MLflow natively, so data scientists can log experiments, compare model performance, promote models to production, and serve predictions — all within the same workspace where the training data lives.

Collaborative notebooks provide an interactive workspace where engineers, analysts, and data scientists work in Python, SQL, Scala, or R within the same notebook. Real-time collaboration, version control integration, and inline visualizations replace the pattern of emailing Jupyter notebooks between teams.

Where Databricks Fits in the Modern Data Stack

Databricks occupies the central "transform and serve" layer of the modern data stack. It sits between ingestion tools and consumption tools, acting as the compute and storage engine where raw data becomes analytics-ready datasets and ML models.

Upstream, ingestion tools like Fivetran and Airbyte pull data from source systems (databases, SaaS APIs, event streams) into the lakehouse. Within Databricks, teams use Spark, SQL, and dbt to transform raw data into clean, modeled tables. Downstream, BI tools (Power BI, Tableau, Looker) connect to Databricks SQL for dashboards and reports. ML models trained in Databricks serve predictions to applications via REST APIs.

Governance spans the full width. Unity Catalog handles access control and lineage within Databricks. For organizations running multiple platforms — Databricks plus Snowflake, plus SaaS tools, plus legacy databases — an external data catalog like Dawiso provides cross-platform governance.

DATABRICKS IN THE MODERN DATA STACKDatabasesSaaS APIsEvent StreamsSOURCESIngestionFivetran, AirbyteINGESTDatabricksDelta Lake + SparkTransform, serve, traindbt + SQL + PythonTRANSFORM + SERVEBI ToolsPower BI, Tableau, LookerAI / ML ServingCONSUMEDawiso — Cross-Platform Governance, Catalog, Business Glossary
Click to enlarge

Common Use Cases

Data engineering pipelines. A fintech company ingests transaction data from payment processors, enriches it with customer profiles from a CRM, and writes clean, deduplicated tables to Delta Lake. Spark handles the heavy lifting — deduplication, schema validation, type casting — while Delta Live Tables (DLT) manages pipeline orchestration and data quality expectations. The resulting tables feed both compliance reports and real-time fraud scoring.

ML model training and serving. A logistics company trains route-optimization models on historical delivery data. Feature engineering, model training, hyperparameter tuning, and experiment tracking all happen within Databricks. MLflow logs every run. When a model beats the current production version, it gets promoted through the model registry and deployed to a serving endpoint — no handoff to a separate ML platform required.

SQL analytics and reporting. A SaaS company uses Databricks SQL to power operational dashboards. Product managers query user engagement metrics through a serverless SQL warehouse connected to Tableau. The same Delta Lake tables that feed the dashboards also feed the churn-prediction model — one copy of data, two workloads, no sync pipeline.

Real-time streaming. A media company processes clickstream events from its platform through Spark Structured Streaming. Events land in Delta Lake within seconds, feeding a real-time recommendation engine and a near-real-time content-performance dashboard. The same architecture handles both the streaming ingestion and the batch aggregations that run overnight for weekly reports.

The data lakehouse market is projected to grow from $9.3 billion in 2024 to $48.2 billion by 2031, driven by organizations consolidating separate lake and warehouse infrastructure into unified platforms.

— Fortune Business Insights, Data Lakehouse Market Report

Pricing and Cost Model

Databricks pricing confuses teams because there are two separate bills. Databricks charges for compute in Databricks Units (DBUs) — a normalized measure of processing power. Your cloud provider separately charges for the VMs running those workloads, the storage holding your data, and any network egress.

DBU rates vary by workload type. Jobs Compute (batch processing) is the cheapest; All-Purpose Compute (interactive notebooks) costs 2-3x more per DBU. Three edition tiers — Standard, Premium, Enterprise — add a multiplier on top. Premium (~1.5x Standard) adds RBAC and audit logging. Enterprise (~2x Standard) adds Unity Catalog and compliance features.

For a small team running daily ETL and ad-hoc analysis, total monthly costs typically land between $1,500 and $3,000. Mid-size production deployments with SQL analytics and ML training run $15,000 to $25,000/month. The biggest cost levers are workload type selection, cluster auto-termination settings, and commit-plan discounts. For a detailed breakdown, see the Databricks pricing guide. For cloud-specific cost differences, see Azure vs. AWS vs. GCP comparison.

Governance and Security

Unity Catalog is Databricks' answer to governance. It provides a centralized metastore that tracks every table, view, ML model, and file across all workspaces. Access control operates at three levels: catalog (top-level namespace), schema (database), and object (table/view/model). Column-level and row-level security allow fine-grained restrictions — an analyst in EMEA sees only EMEA customer data, even though the table contains global records.

Data lineage in Unity Catalog is automatic. When a notebook reads from table A, transforms the data, and writes to table B, Unity Catalog records that dependency. This lineage graph surfaces in the UI and through APIs, making impact analysis straightforward: before changing a column in a source table, you can see every downstream table and dashboard that depends on it.

Security features include encryption at rest and in transit, VPC/VNet deployment for network isolation, SSO integration with Azure AD, Okta, and other identity providers, and compliance certifications including SOC 2, HIPAA, HITRUST, and GDPR. Audit logs capture every data access, query execution, and permission change.

Where Unity Catalog falls short is scope: it governs only what lives inside Databricks. Organizations running Databricks alongside Snowflake, SaaS tools, and legacy databases need a cross-platform governance layer — which is where external catalogs like Dawiso fill the gap.

Limitations and Trade-offs

Vendor dependency despite open formats. Delta Lake is open source and Parquet files are portable, but the Databricks runtime optimizations (Photon, adaptive query execution, Delta Live Tables) only work within Databricks. Moving away means losing those performance advantages and rewriting pipeline orchestration. Open format is not the same as zero switching cost.

Cost complexity. The two-bill model (DBU + cloud infrastructure) makes cost estimation difficult. A team that spins up All-Purpose clusters for interactive exploration without auto-termination policies can generate unexpectedly large bills. Unlike Snowflake's single-bill model, Databricks requires active cost management across two vendors.

Spark learning curve. Teams with deep SQL skills but no Spark experience face a ramp-up period. While Databricks SQL provides a familiar SQL interface, the full power of the platform — pipeline orchestration, streaming, ML — requires understanding distributed computing concepts. This is not an afternoon of training.

Cold-start latency on serverless. Serverless SQL warehouses eliminate cluster management but introduce cold-start delays when a warehouse has been idle. For dashboards with infrequent usage, the first query after idle may take 10-30 seconds to spin up — a noticeable wait for business users accustomed to instant results.

Organizations adopting Databricks' lakehouse platform achieved a 360% return on investment over three years, with data engineering teams reducing pipeline development time by 40%.

— Forrester, The Total Economic Impact of Databricks

How Dawiso Complements Databricks

Unity Catalog governs what lives inside Databricks. Dawiso governs the full stack — Databricks alongside Snowflake, SaaS applications, BI tools, and legacy databases. The two are complementary, not competitive.

UNITY CATALOG VS. DAWISO GOVERNANCE SCOPEUnity CatalogDatabricks scope onlyDelta TablesML ModelsNotebooksVolumes / FilesDawisoCross-platform scopeDatabricksSnowflakeSaaS + BI ToolsBusiness Glossary
Click to enlarge

Dawiso's data catalog indexes metadata from Databricks Unity Catalog alongside metadata from every other system in the organization. Business users search for datasets in one place, regardless of whether the data lives in a Delta table, a Snowflake view, or a SaaS API. Lineage in Dawiso spans the full pipeline — from source system through Databricks transformation to the Power BI dashboard that presents the result.

The business glossary in Dawiso ensures that "revenue," "active customer," and "churn rate" mean the same thing whether the metric is computed in Databricks SQL, queried in Snowflake, or displayed in a Tableau dashboard. Without a shared glossary, different teams build conflicting definitions — a problem that Unity Catalog does not solve because it only sees Databricks assets.

Through the Model Context Protocol (MCP), AI agents can access Dawiso's catalog programmatically — looking up column definitions, checking data freshness, retrieving lineage, and verifying metric ownership through a standardized protocol. This is how data governance scales from manual processes to AI-assisted operations across the full data stack.

Conclusion

Databricks solves a real architectural problem: the data lake + warehouse split that creates duplicate data, duplicate pipelines, and governance gaps. The lakehouse pattern — Delta Lake on cloud storage with Spark for compute — eliminates that duplication. For organizations whose primary workloads combine data engineering, SQL analytics, and ML, it is the most integrated option available. The trade-offs are real — cost complexity, Spark learning curve, and vendor-specific optimizations — but for teams willing to invest in the platform, the consolidation payoff is substantial. The governance gap beyond Databricks' own boundary is where a cross-platform catalog like Dawiso becomes essential.

Dawiso
Built with love for our users
Make Data Simple for Everyone.
Try Dawiso for free today and discover its ease of use firsthand.
© Dawiso s.r.o. All rights reserved