How Are Data Products Connected?
Data products are not standalone assets — they form dependency graphs. An "Order Summary" product consumes data from "Customer Profile" and "Inventory" products. A "Churn Prediction" model depends on "Usage Metrics" and "Support Tickets." Each connection introduces a coupling point that must be managed through clear interfaces, contracts, and shared metadata.
Understanding how these connections work — through APIs, events, contracts, and a shared data catalog — is the difference between a coherent data ecosystem and a tangle of point-to-point integrations that no one fully understands.
Data products connect through APIs (REST, GraphQL, gRPC), event streams (Kafka, pub-sub), and shared metadata (data catalog, business glossary). Connections are governed by data contracts that define schemas, SLAs, and versioning policies. The three topology patterns — point-to-point, hub-and-spoke, and mesh — each trade off simplicity against scalability. A data catalog is the discovery layer that makes connections possible.
Three Connectivity Topologies
How data products connect to each other follows one of three patterns. Each trades off simplicity, governance, and scalability differently.
The number of point-to-point connections grows quadratically with the number of data products. An organization with 50 data products faces up to 2,450 potential connections — far beyond what any central team can manage manually.
— Zhamak Dehghani, Data Mesh
Point-to-point is the simplest topology. Two products connect directly — one calls the other's API or subscribes to its event stream. This works well when the organization has fewer than ten products and connections are stable. But point-to-point connections grow quadratically: 10 products can have up to 45 connections, 20 products up to 190, 50 products up to 2,450. At that scale, no one has a complete picture of what depends on what, and changing any product's schema risks breaking consumers you did not know existed.
Hub-and-spoke routes all connections through a central integration layer — a data platform, API gateway, or integration middleware. Every product publishes data to the hub; every consumer reads from it. This gives the central team visibility over all connections and simplifies monitoring and governance. The trade-off is bottleneck risk: every new connection requires the platform team to build or configure a route, and the hub becomes a single point of failure.
Mesh connectivity lets products connect directly through standardized interfaces, with a shared catalog providing discovery. Products advertise their interfaces and contracts in the catalog; consumers find and connect without routing through a central team. This is the most scalable pattern but requires mature standards for interface design, contract management, and metadata registration.
Technical Interfaces
Three interface patterns handle the vast majority of data product connections. Each trades off latency, coupling, and operational complexity differently.
REST APIs provide synchronous request-response access. A dashboard calls the Customer API to retrieve profile data; the API returns the current record. REST works well for on-demand queries, low-to-moderate volume, and use cases where the consumer needs the latest state. The coupling is moderate: the consumer depends on the API schema and endpoint availability, but communication is stateless and cacheable.
Event streams (Apache Kafka, Google Pub/Sub, Amazon Kinesis) provide asynchronous data propagation. When an order is placed, the Order product publishes an event to a topic. The Inventory product consumes that event and recalculates stock levels. The Fraud Detection product consumes the same event and runs risk scoring. Producers and consumers are decoupled — neither knows the other exists. This pattern excels for real-time updates, fan-out scenarios, and systems where multiple consumers need the same data. The trade-off is operational complexity: running a Kafka cluster requires expertise, and debugging event-driven flows is harder than tracing a synchronous API call.
Change Data Capture (CDC) tracks changes at the database level. Instead of requiring the source product to publish events explicitly, CDC tools like Debezium capture inserts, updates, and deletes from the database transaction log and forward them as events. This keeps downstream products in sync without modifying the source application code. CDC is especially useful when connecting legacy systems that cannot be refactored to publish events natively.
Data Contracts
A data contract is the agreement between a producer and a consumer. It specifies exactly what the consumer can depend on, and what the producer commits to delivering.
A contract has four components. The schema definition specifies fields, data types, and validation rules using formats like Avro, Protobuf, or JSON Schema. The quality guarantees commit to freshness (data no older than X minutes), completeness (at least Y% of rows present), and accuracy (validated against source). The SLA commitments define availability (99.9% uptime) and latency (sub-100ms response). And the versioning policy governs how changes are introduced: additive changes (new optional fields) are backward-compatible minor versions; breaking changes (removed fields, type changes) require a major version bump and a migration period.
For example, an "Order" data product publishes a contract guaranteeing 99.9% availability, sub-100ms latency, and data no older than 5 minutes. If the team needs to rename a field, they release a new major version, maintain both versions during a transition window, and notify consumers through the catalog. Breaking the contract without following the versioning policy is equivalent to shipping a breaking API change without warning.
Discovery Through Metadata
Connections only work if consumers can find producers. Without a catalog, discovery defaults to asking colleagues on Slack, searching wikis, or emailing the data team — the data equivalent of hardcoded IP addresses. This does not scale past a handful of products.
A data catalog provides the discovery layer. Consumers search for data products by domain, schema, owner, or business term. Each product's catalog entry includes schema definitions, lineage (which products feed this one and which consume it), quality scores, access policies, and the current contract version. A product manager looking for customer churn data can search "churn," find the "Churn Prediction" product, review its schema and quality metrics, check who owns it, and request access — all without leaving the catalog.
The average data practitioner spends 30% of their time searching for and understanding data before they can use it. A data catalog reduces this to under 10%.
— Alation, State of Data Culture Maturity Report
Lineage and Impact Analysis
When data products form a dependency graph, changes upstream propagate downstream. If the "Customer" product changes its schema — renames a field, changes a data type, or alters a business rule — every downstream consumer is affected. Without lineage tracking, the team making the change has no way to know who will break.
Data lineage tracks these dependencies as a directed graph. Before making any change to a data product, the team reviews lineage to identify all downstream consumers. If the "Customer" product feeds "Order Summary," "Churn Prediction," and "Revenue Report," the team knows that a schema change affects three products across two domains. They can coordinate the migration, update the contract version, and notify affected teams — all before the change goes live.
This is especially critical for regulatory reporting. When an auditor asks "where does this number on the compliance report come from?", lineage provides an auditable chain from the report back through every transformation and source product. Without lineage, the answer is "we think it comes from somewhere in the warehouse."
Quality Across the Graph
Quality degrades as data moves through the dependency graph. A data product consuming three upstream products inherits any quality issues in all three. If the "Customer" product has 98% completeness and the "Inventory" product has 97% completeness, the "Order Summary" product that joins both starts with a theoretical maximum of around 95% completeness — and that is before its own processing introduces any additional gaps.
Monitoring quality at connection points is the defense. Schema validation catches structural changes before they propagate — a field type change in the source is rejected at ingestion rather than silently corrupting downstream calculations. Freshness checks verify that upstream data has actually arrived within the SLA window; a dashboard showing "data as of 2 hours ago" when the SLA is 5 minutes indicates a broken connection. Completeness thresholds flag when a source delivery is suspiciously small — receiving 1,000 rows when the norm is 100,000 likely indicates a failed extraction rather than a quiet Tuesday.
Circuit breaker patterns prevent a single bad upstream product from corrupting the entire downstream graph. If the "Customer" product starts delivering malformed data, the circuit breaker halts propagation to "Order Summary," "Churn Prediction," and "Revenue Report" rather than letting bad data flow through. The downstream products serve stale-but-correct data until the upstream issue is resolved — which is almost always better than serving fresh-but-wrong data.
Where Dawiso Fits
Dawiso provides the three layers that make data product connectivity work at scale.
Discovery. The data catalog lets consumers find and understand products without asking around. Search by domain, schema, business term, or owner. Review quality scores and contract versions before building a dependency.
Shared definitions. The business glossary ensures that "customer" means the same thing across all products. When the CRM domain's "Customer" product and the Finance domain's "Revenue Report" both reference "active customer," the glossary provides a single canonical definition — preventing the downstream joins that silently produce wrong numbers because two teams defined the same term differently.
Dependency tracking. Data lineage shows how products connect, which would be impacted by changes, and where quality issues originate. Before any schema change, the team checks lineage to understand the blast radius.
Through the Model Context Protocol (MCP), AI agents can query Dawiso's catalog to discover available data products, check quality scores, understand the dependency graph, and assess the impact of proposed changes — enabling automated connection orchestration and impact analysis.
Conclusion
Data product connectivity is not primarily a technical problem — it is a metadata problem. The technical interfaces (APIs, events, CDC) are well-established. The hard part is making connections discoverable, governed, and resilient through shared metadata, clear contracts, and dependency tracking that shows who depends on what.