Deploy ML models to production with pipelines, monitoring, serving, and reproducibility best practices.

Install

openclaw skills install mlops

Quick Reference

TopicFileKey Trap
CI/CD and DAGspipelines.mdCoupling training/inference deps
Model servingserving.mdCold start with large models
Drift and alertsmonitoring.mdOnly technical metrics
Versioningreproducibility.mdNot versioning preprocessing
GPU infrastructuregpu.mdGPU request = full device

Critical Traps

Training-Serving Skew:

  • Preprocessing in notebook ≠ preprocessing in service → silent bugs
  • Pandas in notebook → memory leaks in production (use native types)
  • Feature store values at training time ≠ serving time without proper joins

GPU Memory:

  • requests.nvidia.com/gpu: 1 reserves ENTIRE GPU, not partial memory
  • MIG/MPS sharing has real limitations (not plug-and-play)
  • OOM on GPU kills pod with no useful logs

Model Versioning ≠ Code Versioning:

  • Model artifacts need separate versioning (MLflow, W&B, DVC)
  • Training data version + preprocessing version + code version = reproducibility
  • Rollback requires keeping old model versions deployable

Drift Detection Timing:

  • Retraining trigger isn't just "drift > threshold" → cost/benefit matters
  • Delayed ground truth makes concept drift detection lag weeks
  • Upstream data pipeline changes cause drift without model issues

Scope

This skill ONLY covers:

  • CI/CD pipelines for models
  • Model serving and scaling
  • Monitoring and drift detection
  • Reproducibility practices
  • GPU infrastructure patterns

Does NOT cover: ML algorithms, feature engineering, hyperparameter tuning.