MLOps Reinforcement Learning Production Deployment 2026 IE7374 · Northeastern University

Adaptive ML Inference Platform

A production MLOps system that uses a PPO reinforcement learning agent to dynamically route video frames across three YOLOv8 variants — cutting latency by 58% and cost by 42% without sacrificing accuracy.

IE7374 MLOps · 2026
Northeastern University
YOLOv8 · PPO · Kubernetes · GKE

Autonomous robots and drones running a single heavy ML model waste 40–50 ms per inference frame, creating dangerous blind spots at high speed. At the same time, 60–70% of compute is wasted on simple scenes that a lightweight model handles just as well. This project solves both problems with an intelligent routing layer.

The Core Idea

Deploy three YOLOv8 variants (Nano, Small, Large) simultaneously. Train a PPO agent to observe each incoming frame and pick the optimal model in real time — trading off accuracy, latency, and cost based on scene complexity.

Results

58% faster. 2.6× the throughput. 42% cheaper.

20 ms avg latency ↓ from 48 ms (−58%)
55 FPS throughput ↑ from 21 FPS (2.6×)
$115 / 1M inferences ↓ from $200 (−42%)
95%+ accuracy retained Safety-critical performance
30% GPU utilization ↓ from 95% (3× headroom)

At scale — 1,000 robots running at 30 FPS — the system saves $72,000 per month in compute costs.

Routing Behaviour Learned

The agent learned to think like an engineer

Scene Type Objects Model Selected Routing %
Simple / sparse ≤ 2 YOLOv8-Nano ~70%
Moderate 3 – 7 YOLOv8-Small ~20%
Complex / dense ≥ 8 YOLOv8-Large ~10%
Architecture

Three-tier system end-to-end

Tier 01
Data Pipeline
8-stage Apache Airflow DAG converts COCO 2017 (123K images) to YOLO format with DVC caching, schema validation, anomaly detection, and bias-aware stratified splitting.
Tier 02
RL Agent & Training
PPO agent with 1028-dim observation space (32×32 grayscale + Canny edge density). Warm-started via Behavioral Cloning to overcome value-function collapse. Reward balances quality, latency, and cost.
Tier 03
Production Serving
FastAPI backend with WebSocket streaming, Streamlit dashboard, MLflow session logging, Kubernetes deployment on GKE with HPA autoscaling and HTTPS ingress.

Data flow

Video frame → Feature extraction (1028-dim vector) → PPO agent decision → Selected YOLO model + fixed baseline run in parallel → JSON detections → WebSocket to Streamlit UI → MLflow logs metrics per session.

Training Journey

7 failed attempts before finding the solution

Getting the RL agent to work was the hardest part. After 7 different training approaches all failed, we diagnosed the root cause: PPO value-function collapse on this domain. The fix was Behavioral Cloning — a supervised learning warm-start derived from analytically-optimal labels.

Solution

Pre-profile all three models on the dataset to generate optimal routing labels analytically. Train a BC classifier on those labels first, then fine-tune with PPO. Final BC classifier accuracy: 43.5% vs. 33% random baseline — a balanced ~33%/33%/33% routing distribution confirming a healthy policy.

Production Monitoring

Drift detection & automated retraining

Every 6 hours
Drift Detection
Evidently AI runs as a Kubernetes CronJob, comparing live inference distributions against the reference dataset. Triggers alerts on significant drift.
On drift / decay
Automated Retraining
CI/CD pipeline (GitHub Actions) triggers model retraining and re-deployment via rolling update on GKE. Slack notifications keep the team informed.
Continuous
Prometheus + Grafana
Latency, throughput, GPU utilization, and routing balance (each model must receive >20% of frames) are scraped and dashboarded in real time.
Tech Stack

Full-stack MLOps

ML / Training
YOLOv8 Stable Baselines3 PyTorch ONNX Runtime
Data Pipeline
Apache Airflow DVC COCO 2017
Serving
FastAPI WebSocket Streamlit MLflow
Infra & MLOps
Kubernetes (GKE) Docker Prometheus Grafana Evidently AI GitHub Actions
Setup

Running the project

Bash
# 1. Data pipeline
cd Data-Pipeline
docker-compose up -d          # starts Airflow + DVC
# Trigger the 8-stage DAG from Airflow UI at localhost:8080

# 2. Train the RL agent
cd model_pipeline/src/RL
python train_bc.py              # Behavioral Cloning warm-start
python train_ppo.py             # PPO fine-tune

# 3. Launch the serving stack
python serve_fastapi.py         # FastAPI + WebSocket at :8000
streamlit run dashboard.py     # Streamlit UI at :8501

# 4. Deploy to Kubernetes
kubectl apply -f infra/k8s/
Outcome
End-to-end MLOps — from raw data to self-healing production.

This project demonstrates the full MLOps lifecycle: a reproducible data pipeline, an RL agent that learns optimal routing from scratch, a production serving layer with real-time monitoring, and automated drift detection with self-triggered retraining — all containerised and deployed to GKE. The system delivers enterprise-grade performance improvements while keeping compute costs under control at scale.

HM
Hemanth Sai .M
MS AI · Northeastern University
← Back to Projects