MLOps: machine learning model lifecycle in production

Context

Why MLOps? The problem of ML technical debt

Moving a Jupyter notebook to production is one of the most underestimated challenges in machine learning. Models degrade over time (drift), dependencies evolve, data changes, and without rigorous practices, maintaining a model in production quickly becomes a technical debt sink.

MLOps (Machine Learning Operations) addresses this problem by applying DevOps principles — automation, reproducibility, monitoring, CI/CD — to the ML lifecycle. The goal: deploy models quickly, reliably, and keep them performing over time.

Scale of ML technical debt

The seminal paper 'Hidden Technical Debt in Machine Learning Systems' (Sculley et al., Google, NeurIPS 2015) showed that ML code itself represents only a small fraction of the system: configuration, data collection, feature verification, serving infrastructure, monitoring — all of this constitutes the majority of the real complexity.

Sculley et al. - Hidden Technical Debt in Machine Learning Systems, NeurIPS 2015

Foundations

The ML model lifecycle

The ML lifecycle comprises several sequential but iterative phases, each of which can be automated and versioned.

1. Data preparation and feature engineering

Collection and validation of source data, cleaning and handling of missing values, feature creation (transformation, encoding, normalization), train/validation/test split with attention to data leakage, and dataset versioning (DVC, Delta Lake, Iceberg). A centralized feature store can store and serve features consistently between training and inference.

2. Experimentation and model selection

Tracked experimentation: MLflow Tracking, Weights & Biases, Comet ML or Neptune allow logging hyperparameters, metrics, artifacts, and code for each run. Hyperparameter tuning (Optuna, Ray Tune, Keras Tuner). Selection based on business metrics (not just loss) and robust cross-validation. Experiments must be reproducible — fixed environment (Docker), fixed seed, fixed library versions.

3. Evaluation and validation before deployment

A model does not get deployed just because it passes a precision threshold. Pre-deployment evaluation includes: performance tests by segment (fairness, bias), evaluation on recent data outside the training period, robustness tests (perturbations, adversarial inputs), verification of compliance with serving latency and throughput SLAs.

4. Deployment and serving

Deployment strategies: blue/green (instant switch), canary (progressive rollout), shadow mode (parallel prediction without production impact), A/B testing (version comparison). Serving formats: REST API (FastAPI + Docker), batch scoring (Spark, dbt + SQL), edge deployment (ONNX, TFLite, Core ML). The model registry stores artifacts, versions, and metadata for each model promoted to staging then production.

Automation

CI/CD for ML: pipelines and model registry

ML CI/CD is different from classic application CI/CD: in addition to code, you need to version and test data, features, and models.

Typical ML CI/CD pipeline

Trigger (push, schedule, or new data batch) → data validation (schema, distribution, completeness) → reproducible feature engineering → model training → automatic evaluation (metrics vs baseline and vs model in prod) → registration in model registry if OK → serving integration tests (latency, output format) → deployment to staging → smoke tests → promotion to production. Tools: GitHub Actions, GitLab CI, Kubeflow Pipelines, Prefect, Airflow, Vertex AI Pipelines.

Model registry: versioning and model governance

A model registry is the centralized catalog of ML models: it stores artifacts (weights, preprocessing pipeline), metadata (metrics, dataset used, hyperparameters), stages (Staging, Production, Archived) and transitions with approval. MLflow Model Registry, Vertex AI Model Registry, SageMaker Model Registry and Weights & Biases Registry are the most widespread solutions. Without a registry, the 'model in prod' is often a .pkl file somewhere on a server, with no traceability.

CT (Continuous Training) vs classic CI/CD

ML CI/CD adds Continuous Training (CT): in addition to testing and deploying code, models are automatically retrained on new data when a condition is met (detected drift, performance below a threshold, new weekly batch). This is the loop that keeps models up-to-date without manual intervention.

Infrastructure

Feature store: centralizing and reusing ML features

A feature store is a platform that centralizes the creation, storage, documentation, and serving of ML features. It solves a key problem: preventing each team from recalculating the same features differently, with a risk of training/serving skew (features calculated differently at training and inference time).

Feature store architecture

A feature store has two parts: the offline store (batch database for training — S3, BigQuery, Snowflake) and the online store (low-latency database for real-time inference — Redis, DynamoDB, Bigtable). The feature pipeline synchronizes both. Major tools: Feast (open-source, cloud-agnostic), Tecton (managed, enterprise), Vertex AI Feature Store (GCP), SageMaker Feature Store (AWS), Databricks Feature Engineering.

Training/serving skew: the classic trap

Training/serving skew occurs when features used during training are calculated differently during inference (e.g., normalization with different statistics, different encoding, different time windows). A feature store guarantees that the same logic is applied in both contexts. Without a feature store, this type of bug is silent and can degrade production performance without obvious alerts.

Production

ML monitoring: data drift, concept drift, and alerts

ML monitoring is fundamentally different from application monitoring. An API can be 'up' (OK latency, no 5xx errors) while producing incorrect predictions due to a change in input data. ML monitoring watches prediction quality, not just infrastructure.

Types of drift and detection methods

Data drift (covariate drift): the distribution of inputs changes relative to the training distribution. Detection: statistical tests (Kolmogorov-Smirnov, Chi-square, Population Stability Index) on production features vs training. Concept drift: the relationship between inputs and output changes — the model may have been trained on data that no longer reflects the real world. Harder to detect: requires production labels (ground truth). Performance drift: business metrics (precision, recall, RMSE, business KPIs) degrade. Requires a production labeling pipeline.

ML monitoring tools

Evidently AI (open-source, HTML reports and Grafana dashboards), Arize AI, WhyLabs, Fiddler (enterprise), Seldon Alibi Detect (Python library). Cloud platforms have their own solutions: Vertex AI Model Monitoring, SageMaker Model Monitor, Azure ML Data Drift. ML monitoring typically integrates with the existing observability stack (Prometheus, Grafana, Datadog).

Retraining: when and how?

There are two retraining strategies: schedule-based (every week, every month — simple but may retrain unnecessarily or not enough) and trigger-based (when drift exceeds a threshold or when performance drops below a threshold — more precise but requires a good monitoring pipeline). Retraining is triggered by CI/CD: new dataset → automatic training pipeline → model registry → deployment if metrics OK.

Method

Anchoring MLOps concepts with spaced repetition

MLOps combines infrastructure concepts (feature store, model registry, pipelines), methodology (CI/CD ML, Continuous Training), and statistics (drift, statistical tests). Flashcards allow anchoring each concept independently and connecting them in their end-to-end logic.

Essential MLOps flashcards

The 4 phases of the ML lifecycle, difference between data drift vs concept drift vs performance drift, feature store role and training/serving skew, deployment strategies (blue/green, canary, shadow), model registry role, and associated tools (MLflow, Feast, Evidently). Frequent questions in ML Engineer and senior Data Scientist interviews.

Frequently asked questions about MLOps

What is MLOps?

MLOps (Machine Learning Operations) is a set of practices combining DevOps, Data Engineering, and Machine Learning to industrialize the ML model lifecycle. It covers data and model versioning, automated training and deployment pipelines, production monitoring, and retraining. The goal is to go from a notebook to a reliable, maintainable ML system.

What is the difference between MLOps and DevOps?

DevOps automates the lifecycle of application code (build, test, deploy, monitor). MLOps extends these principles to ML: you also need to version data and models (not just code), test models (not just the application), monitor prediction quality (not just infrastructure), and manage retraining as data evolves. ML CI/CD is therefore a superset of classic CI/CD.

What is a feature store?

A feature store is a platform that centralizes the creation, storage, and serving of ML features. It includes an offline store for training (batch, high capacity) and an online store for real-time inference (low latency). It ensures that features are calculated the same way at training and in production, preventing training/serving skew. Tools: Feast, Tecton, Vertex AI Feature Store, SageMaker Feature Store.

What is drift in ML?

Drift in ML refers to model performance degradation due to changes in data or relationships between variables. Data drift (covariate drift): the distribution of inputs changes relative to training. Concept drift: the relationship between inputs and output changes (the world evolves). Performance drift: business metrics degrade. Each type is detected differently and requires an adapted response.

What is a model registry?

A model registry is the centralized catalog of ML models: it stores artifacts (weights, preprocessing pipeline), metadata (metrics, dataset, hyperparameters), lifecycle stages (Staging, Production, Archived), and transitions with history. Without a registry, production models are not traceable. Tools: MLflow Model Registry, Vertex AI, SageMaker, Weights & Biases.

What are ML deployment strategies?

Blue/green: one deployment instantly replaces another, easy rollback. Canary: traffic is progressively redirected (5% → 20% → 100%) to the new version. Shadow mode: the new model makes predictions in parallel without user impact, useful for validating before going live. A/B testing: two versions serve different user segments to compare performance. The choice depends on risk and volume.

What are the most used MLOps tools?

Experimentation: MLflow, Weights & Biases, Comet ML. Pipelines: Kubeflow, Prefect, Airflow, Vertex AI Pipelines. Feature store: Feast, Tecton, Vertex AI Feature Store. Serving: BentoML, Seldon, KServe, FastAPI + Docker. Model registry: MLflow, Vertex AI, SageMaker. Monitoring: Evidently AI, Arize AI, WhyLabs. Data versioning: DVC. Most cloud providers (GCP Vertex AI, AWS SageMaker, Azure ML) also offer integrated MLOps platforms.

Previous article: Data Mesh, Data Products and Data Contracts

Back to Data & AI guide

MLOps:
from training to production monitoring

What you will learn