Kubernetes MLOps Platform

Kubernetes MLOps Platform

January 02, 2024

Kubernetes-MLOps contains the results from a master thesis on “Automation and Orchestration for Machine Learning Pipelines” by Filip Melberg and Vasiliki Kostara, supervised by Hamid Ebadi at Infotiv AB. It provides scripts and documentation to set up a fully reproducible Kubernetes-based ML training cluster.

High-Level Design

Two ML projects are supported:

Object Detection (YOLOv5)

Containerized training of a forklift/people object detector on a Kubernetes cluster.

Distributed Data Parallel (DDP) Training

Introduces distributed GPU training, extending the cluster with additional worker nodes.

High-Level Requirements

  1. Automated and reproducible (Docker, scripts, config files)
  2. Open-source / free solution
  3. Use of cluster for training and load balancing
  4. Version controlling for code, data, and models

Infrastructure

The setup includes:

  • Kubernetes control plane and worker nodes on VirtualBox VMs (Ubuntu 22.04)
  • Local image registry and Samba storage
  • Kubernetes Dashboard
  • Networked via Bridged Adapter (all nodes on the same LAN)

Funding

Research carried out within the SMILE IV project, financed by Vinnova, FFI under grant number 2023-00789.

Kubernetes-MLOps GitHub repository