Adaptive ML Inference Platform

Autonomous robots and drones running a single heavy ML model waste 40–50 ms per inference frame, creating dangerous blind spots at high speed. At the same time, 60–70% of compute is wasted on simple scenes that a lightweight model handles just as well. This project solves both problems with an intelligent routing layer.

The Core Idea

Deploy three YOLOv8 variants (Nano, Small, Large) simultaneously. Train a PPO agent to observe each incoming frame and pick the optimal model in real time — trading off accuracy, latency, and cost based on scene complexity.

Results

58% faster. 2.6× the throughput. 42% cheaper.

20 ms avg latency ↓ from 48 ms (−58%)

55 FPS throughput ↑ from 21 FPS (2.6×)

$115 / 1M inferences ↓ from $200 (−42%)

95%+ accuracy retained Safety-critical performance

30% GPU utilization ↓ from 95% (3× headroom)

At scale — 1,000 robots running at 30 FPS — the system saves $72,000 per month in compute costs.

Routing Behaviour Learned

The agent learned to think like an engineer

Scene Type	Objects	Model Selected	Routing %
Simple / sparse	≤ 2	YOLOv8-Nano	~70%
Moderate	3 – 7	YOLOv8-Small	~20%
Complex / dense	≥ 8	YOLOv8-Large	~10%

Architecture

Three-tier system end-to-end

Tier 01

Data Pipeline

8-stage Apache Airflow DAG converts COCO 2017 (123K images) to YOLO format with DVC caching, schema validation, anomaly detection, and bias-aware stratified splitting.

Tier 02

RL Agent & Training

PPO agent with 1028-dim observation space (32×32 grayscale + Canny edge density). Warm-started via Behavioral Cloning to overcome value-function collapse. Reward balances quality, latency, and cost.

Tier 03

Production Serving

FastAPI backend with WebSocket streaming, Streamlit dashboard, MLflow session logging, Kubernetes deployment on GKE with HPA autoscaling and HTTPS ingress.

Data flow

Video frame → Feature extraction (1028-dim vector) → PPO agent decision → Selected YOLO model + fixed baseline run in parallel → JSON detections → WebSocket to Streamlit UI → MLflow logs metrics per session.

Training Journey

7 failed attempts before finding the solution

Getting the RL agent to work was the hardest part. After 7 different training approaches all failed, we diagnosed the root cause: PPO value-function collapse on this domain. The fix was Behavioral Cloning — a supervised learning warm-start derived from analytically-optimal labels.

Solution

Pre-profile all three models on the dataset to generate optimal routing labels analytically. Train a BC classifier on those labels first, then fine-tune with PPO. Final BC classifier accuracy: 43.5% vs. 33% random baseline — a balanced ~33%/33%/33% routing distribution confirming a healthy policy.

Production Monitoring

Drift detection & automated retraining

Every 6 hours

Drift Detection

Evidently AI runs as a Kubernetes CronJob, comparing live inference distributions against the reference dataset. Triggers alerts on significant drift.

On drift / decay

Automated Retraining

CI/CD pipeline (GitHub Actions) triggers model retraining and re-deployment via rolling update on GKE. Slack notifications keep the team informed.

Continuous

Prometheus + Grafana

Latency, throughput, GPU utilization, and routing balance (each model must receive >20% of frames) are scraped and dashboarded in real time.

Tech Stack

Full-stack MLOps

ML / Training

YOLOv8 Stable Baselines3 PyTorch ONNX Runtime

Data Pipeline

Apache Airflow DVC COCO 2017

Serving

FastAPI WebSocket Streamlit MLflow

Infra & MLOps

Kubernetes (GKE) Docker Prometheus Grafana Evidently AI GitHub Actions

Setup

Running the project

Bash

# 1. Data pipeline
cd Data-Pipeline
docker-compose up -d          # starts Airflow + DVC
# Trigger the 8-stage DAG from Airflow UI at localhost:8080

# 2. Train the RL agent
cd model_pipeline/src/RL
python train_bc.py              # Behavioral Cloning warm-start
python train_ppo.py             # PPO fine-tune

# 3. Launch the serving stack
python serve_fastapi.py         # FastAPI + WebSocket at :8000
streamlit run dashboard.py     # Streamlit UI at :8501

# 4. Deploy to Kubernetes
kubectl apply -f infra/k8s/

Outcome

End-to-end MLOps — from raw data to self-healing production.

This project demonstrates the full MLOps lifecycle: a reproducible data pipeline, an RL agent that learns optimal routing from scratch, a production serving layer with real-time monitoring, and automated drift detection with self-triggered retraining — all containerised and deployed to GKE. The system delivers enterprise-grade performance improvements while keeping compute costs under control at scale.

↗ View on GitHub