Machine Learning Operations (MLOps)
MLOps applies DevOps principles to machine learning — automating the path from model development to production deployment and ongoing monitoring. The discipline exists because ML systems have failure modes that traditional software does not: data drift, training-serving skew, and model decay over time.
A software bug produces the same wrong output every time it runs. A degraded ML model produces outputs that look plausible but are quietly wrong — and get worse as the data it was trained on falls out of date. MLOps builds the engineering infrastructure to detect, prevent, and fix these failures before they reach business outcomes.
MLOps is the discipline of deploying, monitoring, and maintaining ML models in production. It covers experiment tracking, CI/CD for models, automated retraining, and drift detection. Most ML projects fail not in the lab but in production — models degrade as data changes, features drift, and business conditions shift. MLOps closes this gap with automated pipelines, continuous monitoring, and governed metadata about datasets, features, and model lineage.
Why MLOps Exists
The gap between ML experimentation and production deployment is where most AI projects die. A data scientist's Jupyter notebook is not a production system. It runs on a local machine, reads from a snapshot of data, uses manually installed packages, and produces results that no one else can reproduce.
ML systems introduce dependencies that traditional software doesn't have. A web application depends on its code and its runtime environment. An ML model depends on its code, its runtime, and the data it was trained on, the features it was served, and the statistical distribution of its inputs. When any of these change — and in a live business environment, they all change continuously — the model's performance shifts.
A fraud detection model trained on 2023 transaction data sees accuracy drop 12% by Q2 2024 because spending patterns shifted. A recommendation engine that worked well during holiday season produces irrelevant suggestions in January because the input distribution changed. A credit scoring model breaks silently when an upstream system renames a column from "annual_income" to "yearly_salary." These are not software bugs. They are data-system dependencies that only MLOps practices can catch and manage.
The MLOps Lifecycle
The MLOps lifecycle is a continuous loop, not a linear pipeline. Each stage feeds back into the others, and skipping any stage creates a gap that surfaces as a production failure.
Data ingestion pulls from source systems and validates incoming data against expected schemas, types, and distributions. When a source system changes its API response format or renames a field, data validation catches the break before it reaches the model.
Feature engineering transforms raw data into the inputs models consume. A "days_since_last_purchase" feature requires knowing the current date, the purchase timestamp, and whether "purchase" means completed orders or includes pending ones. Feature definitions must be consistent between training and serving — the most common source of silent production errors.
Training fits model parameters to historical data. Automated training pipelines handle hyperparameter tuning, cross-validation, and experiment tracking. Each training run records the dataset version, feature set, hyperparameters, and evaluation metrics.
Validation tests the trained model against holdout data, checks for bias across demographic segments, and compares performance against the currently deployed model. A new model only proceeds to deployment if it meets predefined quality gates.
Deployment pushes the validated model to production with canary rollout or A/B testing. Containerized serving ensures the production environment matches the training environment — eliminating "works on my machine" failures.
Monitoring tracks prediction distributions, latency, accuracy metrics, and input data drift in real time. When monitoring detects that the model's input distribution has shifted beyond a threshold, it triggers the retraining loop — closing the cycle.
Only 36% of organizations have moved any ML model to production. The remainder are stuck in experimental phases, often because they lack the engineering infrastructure to operationalize models reliably.
— Algorithmia, 2021 Enterprise Trends in Machine Learning
Key MLOps Practices
Five practices separate teams that ship ML models reliably from those stuck in notebook-to-production purgatory.
Experiment tracking and reproducibility. Every training run logs its dataset version, code commit, hyperparameters, random seed, and evaluation metrics. Six months later, when someone asks "why did model v3 perform better than v4?", the experiment log provides the answer. Without tracking, teams cannot compare runs, reproduce results, or diagnose regressions. Tools like MLflow and Weights & Biases provide this infrastructure.
CI/CD for ML. Traditional CI/CD tests code. ML CI/CD also tests data and models. The pipeline validates input data schemas, checks feature distributions, runs model evaluation against a holdout set, and compares performance metrics to the currently deployed model. Only runs that pass all quality gates proceed to deployment. This prevents shipping a model that scores well on training data but fails on production distributions.
Feature management and feature stores. Feature stores provide a shared registry of engineered features that can be reused across models. Instead of each model team reimplementing "customer_lifetime_value" differently, a feature store provides one governed computation that all models consume. This eliminates inconsistency, reduces duplicated engineering, and creates a single source of truth for feature definitions.
Model monitoring and drift detection. Production models need two kinds of monitoring: data drift (has the input distribution changed from what the model was trained on?) and concept drift (has the relationship between inputs and outputs changed?). A model that predicts loan defaults trains on pre-pandemic data and then serves post-pandemic applications. The input distribution shifted (different income patterns), and the concept changed (different default dynamics). Both kinds of drift require retraining — but detecting them requires monitoring infrastructure that most teams lack.
Model registry and versioning. A model registry tracks every deployed model version, its training data, its evaluation metrics, and its deployment status. When a production incident occurs, the registry shows exactly which model version is running, what data it was trained on, and whether a rollback candidate exists. Without a registry, debugging production issues starts with "which model is even running right now?" — a question that shouldn't take an hour to answer.
The Metadata Problem in MLOps
Every MLOps pipeline depends on metadata, whether the team manages it or not. The question is whether that metadata is governed or left as tribal knowledge that evaporates when someone changes teams.
What dataset version was used for training? Without dataset versioning and a data catalog entry that records the extraction date, filter criteria, and source tables, reproducing a training run is guesswork. A model retrained on "the same data" that actually includes three months of additional records produces different results — and no one can explain why.
Which features were included? A feature named "avg_order_value" means one thing to the team that built it and something slightly different to the team that consumes it. One includes shipping; the other doesn't. A business glossary with canonical feature definitions prevents this ambiguity from compounding across models that share features.
What business rules define the target variable? "Churn" means "no purchase in 90 days" to the marketing team and "subscription cancellation" to the product team. A model trained on one definition and evaluated against the other produces misleading accuracy metrics. The target variable definition belongs in the data governance layer, not in a data scientist's notebook.
Who owns the model? When a model produces a wrong prediction that costs money, who is responsible for investigating and fixing it? Model ownership, along with dataset ownership and feature ownership, needs to be tracked in metadata — not in Slack threads.
What data quality issues exist in the training set? If 15% of the "customer_age" column contains null values that were imputed with the mean, and a new data pipeline fixes the nulls with actual values, the model's behavior changes unpredictably. Data lineage and quality metadata document these known issues so downstream consumers — including ML pipelines — can account for them.
Hidden technical debt in ML systems is dominated by data dependencies, not code. Configuration, data collection, feature extraction, and monitoring infrastructure dwarf the ML code itself.
— Sculley et al., Hidden Technical Debt in Machine Learning Systems, NeurIPS 2015
MLOps Tool Landscape
MLOps tools organize by function. Most teams adopt one tool per function and integrate them into a cohesive pipeline.
Experiment tracking: MLflow (open source, broad ecosystem support) and Weights & Biases (collaborative features, visualization). Both track experiment parameters, metrics, and artifacts. MLflow is self-hosted; W&B is primarily cloud-hosted with an enterprise on-premises option.
Orchestration: Apache Airflow (general-purpose workflow scheduling, large community) and Kubeflow (Kubernetes-native, ML-specific primitives). Airflow is the default for organizations already running it for data pipelines; Kubeflow suits teams with Kubernetes expertise who want tighter ML integration.
Model serving: Seldon and BentoML (open source, Kubernetes-based serving with A/B testing and canary deployments), plus cloud-native options like AWS SageMaker, Azure ML, and Vertex AI endpoints. The choice depends on whether the organization runs its own infrastructure or delegates to a cloud provider.
Monitoring: Evidently (open source, data and model drift detection) and WhyLabs (managed platform, real-time monitoring). Both track input distributions, prediction distributions, and feature drift — the metrics that traditional application monitoring doesn't capture.
Feature stores: Feast (open source, integrates with existing data infrastructure) and Tecton (managed, real-time feature serving). Feature stores ensure training and serving features are computed identically — preventing the most common source of silent production errors.
Common MLOps Failures
MLOps failures are distinct from software failures. They are often silent, gradual, and invisible to traditional monitoring.
Training-serving skew. The training pipeline computes "average_order_value" by dividing total revenue by order count, including returns. The serving pipeline uses a different query that excludes returns. Same feature name, different values. The model makes predictions based on inputs it never saw during training. This is the most common MLOps failure and the hardest to detect because each pipeline individually produces correct results — they just don't agree with each other.
Silent model degradation. A model's accuracy drops from 94% to 81% over four months as customer behavior evolves. Without monitoring that compares prediction distributions to training distributions, no one notices until a quarterly business review reveals that the model's recommendations are no longer effective. Data observability tools catch this drift in real time.
Reproducibility failures. A data scientist retrains a model and gets different results than the original training run. The training data changed (new records added), the feature engineering code was updated (a colleague fixed a bug), or the Python package versions drifted. Without experiment tracking and environment pinning, reproducing a specific model version is impossible.
Data quality regressions. An upstream system migrates to a new database, and the schema changes: a column that was INTEGER becomes VARCHAR. The feature pipeline doesn't crash — it silently coerces the string to NaN and the model trains on missing values it never expected. Data validation at ingestion time, with schema checks tracked in the catalog, catches this before it propagates.
How Dawiso Supports MLOps
Dawiso's data catalog provides the metadata layer that MLOps pipelines depend on but rarely build for themselves. Dataset documentation, column-level lineage, and business glossary definitions give ML engineers the context they need to build reliable features and debug production failures.
When a feature pipeline breaks, lineage shows which upstream source changed. Instead of manually tracing data through ETL jobs and database views, engineers query the catalog for the full data path from source system to feature store to model input. This reduces incident investigation time from hours to minutes.
Through MCP, MLOps tools can programmatically query Dawiso for dataset definitions, check data freshness before triggering a training run, and verify feature ownership. An automated pipeline can refuse to train if a required dataset hasn't been updated within its expected freshness window — a decision that requires metadata about data schedules and SLAs that the catalog provides.
Dawiso also tracks dataset governance status — marking datasets as "governed," "draft," or "deprecated." MLOps teams can enforce policies that prevent training on deprecated datasets or datasets that haven't passed quality checks, closing the gap between data governance and model reliability.
Conclusion
MLOps is not about adding infrastructure for the sake of engineering rigor. It exists because ML models fail differently than software: silently, gradually, and in ways that traditional monitoring misses. The lifecycle — data, features, training, validation, deployment, monitoring — is a continuous loop where each stage depends on metadata from the others. Organizations that treat MLOps as a metadata problem, not just an infrastructure problem, ship models that stay reliable in production long after the data scientist who built them has moved to the next project.