Ibrahim Ghali

JupyterHub on Kubernetes

JupyterHub on Kubernetes

Overview

JupyterHub on Kubernetes with Task Queue and Resource Optimization is a high-availability platform designed to support multiple users running Jupyter notebooks in parallel for AI, ML, and scientific computing workloads. The system integrates dynamic resource scheduling, persistent storage, and container orchestration, ensuring fair distribution of CPU/GPU resources and uninterrupted access to user environments.

This project was completed as a Final Year Project for the National Engineering Degree in Computer Science, Data Engineering Option – Faculty of Sciences of Sfax, showcasing expertise in scalable, production-ready data platforms.

Deployment Architecture

  • High-Availability Kubernetes Cluster: Three master nodes with HAProxy for load balancing and worker nodes hosting JupyterHub environments.
  • Persistent Storage: NFS provides shared storage across all pods, ensuring users’ notebooks and data are preserved.
  • Dynamic User Environments: Each user is provisioned a dedicated JupyterLab instance, dynamically managed by Kubernetes for scalability and isolation.
  • Resource Optimization: Apache YuniKorn manages CPU/GPU scheduling with a task queue, ensuring fair and efficient allocation across multiple users.

Technologies Used

  • Kubernetes: Container orchestration, high availability, and fault tolerance
  • JupyterHub: Multi-user management of isolated JupyterLab environments
  • Helm: Deployment automation and package management
  • Apache YuniKorn: Fine-grained task scheduling for CPU/GPU workloads
  • NFS: Shared persistent storage for notebooks and user data
  • HAProxy: Load balancing for Kubernetes master nodes

Key Features

  • Multi-user access to isolated, production-ready Jupyter environments
  • Dynamic, fair CPU/GPU resource scheduling with YuniKorn
  • Persistent notebook storage for reliable user data management
  • High-availability cluster architecture to minimize downtime
  • Support for resource-intensive AI/ML and scientific computing workloads

Business & Technical Impact

This platform demonstrates the ability to deploy and manage complex, scalable, and high-performance data infrastructures. It highlights skills in cloud-native architecture, container orchestration, resource optimization, and multi-user platform management, enabling organizations to maximize hardware utilization, ensure data persistence, and support intensive computational workloads.

Future Improvements

  • Monitoring & Observability: Integration with Prometheus and Grafana for real-time visualization of CPU/GPU/memory usage and custom dashboards for cluster administration.