MLOps

MLOps is a practice that automates and standardizes the entire lifecycle of machine learning model development, training, deployment, and monitoring, enabling the continuous operation of models in production environments.
"Building a Model" and "Operating a Model" Are Different Jobs
Even if you can build a highly accurate model in Jupyter Notebook, keeping it running stably in a production environment requires an entirely different skill set. Updating training data, retraining models, version control, A/B testing, detecting performance degradation——managing all of this manually will break down regardless of team size.
MLOps applies the DevOps philosophy to machine learning, but it comes with unique challenges that differ from software deployment. These include the need to simultaneously version-control three things——code, data, and model weights——the fact that model performance degrades over time due to shifts in data distribution (drift), and the requirement to ensure reproducibility of experiments.
Components of an MLOps Pipeline
Data Pipeline: Automates the collection, preprocessing, and validation of training data. Since data quality directly determines model quality, this is the most critical layer.
Experiment Tracking: Tools like MLflow, Weights & Biases, and Comet are used to record hyperparameters, learning curves, and evaluation metrics, ensuring reproducibility of experiments.
Model Registry: Stores trained models with versioning and manages the promotion flow from staging to production.
Serving: Exposes models as APIs. Inference engines such as vLLM, TensorRT-LLM, and Triton Inference Server are commonly used.
Monitoring: Tracks not only inference latency and error rates, but also data drift (shifts in input data distribution) and model drift (gradual degradation of accuracy over time). It is also common to have a mechanism that automatically triggers retraining when a threshold is exceeded.
MLOps in the Age of LLMs
The rise of LLMs has given birth to a derivative concept called "LLMOps." New operational challenges have emerged that did not exist in traditional MLOps, including prompt version control, evaluation of RAG pipelines, configuring guardrails, and optimizing inference costs. The toolchain has also expanded to include LLM-specific offerings such as LangSmith, Braintrust, and Arize AI.
Related Terms

AI ROI (Return on Investment in AI)
AI ROI is a metric that quantitatively measures the effects obtained — such as operational efficienc

AI Observability
An operational practice of continuously monitoring and visualizing the inputs/outputs, latency, cost

Ambient AI
Ambient AI refers to an AI system that is seamlessly embedded in the user's environment, continuousl

BPO (Business Process Outsourcing)
BPO refers to a form of outsourcing in which a company delegates specific business processes to an e