MLOps Podcast Classifier

1. Project Overview
Podcasts contain a wide variety of content, making manual classification of suitability for children slow, inconsistent, and error-prone. This project implements an end-to-end MLOps pipeline that automates the ingestion, processing, annotation, training, deployment, and monitoring of a classification model for podcasts. The system ensures reproducibility, scalability, and reliability.
2. Problem Statement
Manual classification of podcast content is time-consuming and inaccurate. Real-time processing of large-scale podcast metadata is challenging. Tracking changes in data and models manually introduces risk and reduces reproducibility. Maintaining secure, production-ready deployments adds operational complexity.
3. Solution Overview
The pipeline provides a fully automated, production-ready MLOps solution:
- Data Ingestion: Podcast metadata, including keywords and descriptions, is consumed in real-time from Kafka topics.
- Data Processing & Annotation: Raw data is processed and annotated for model training. Dataset versioning and tracking are handled with DVC, with remote storage on a VPC hosted by OXAHOST.
- Model Training & Experiment Tracking: A classification model predicts whether podcast content is suitable for children. Experiments and metadata are tracked in MLflow, with PostgreSQL storing experiment metadata. The best model is registered in MLflow Registry for deployment.
- CI/CD Pipeline: Jenkins automates build, test, and deployment processes. New data changes pushed to GitHub trigger retraining and redeployment.
- Model Deployment: The best model is served via a FastAPI API, enabling real-time recommendations. All services are containerized using Docker for portability and reproducibility.
- Monitoring & Security: Production data is monitored for data drift to maintain model reliability. Front-end and back-end communication is secured via HTTPS, and DNS is managed through Cloudflare.
4. System Architecture
The data flows from Kafka into the processing and annotation layer, then to DVC for versioning. Models are trained and tracked in MLflow, with the best model deployed via FastAPI to the front-end. Jenkins automates the CI/CD pipeline, and monitoring ensures data drift is detected in production.
Key technologies include Kafka, DVC, Jenkins, MLflow, FastAPI, Docker, PostgreSQL, Cloudflare, and VPC.
5. Key Features & Benefits
The pipeline enables real-time recommendations, providing automatic classification of podcast content suitability for children. CI/CD automation reduces manual intervention in retraining and deployment. Experiment tracking with MLflow and data versioning with DVC ensures reproducibility and version control. Model registration in MLflow centralizes production-ready models. FastAPI deployment ensures low-latency API serving. Continuous data drift monitoring maintains ongoing model reliability. Docker containerization allows reproducible, portable, and scalable deployments. HTTPS and Cloudflare DNS ensure secure communication and domain management.
6. Before → After
Before the pipeline, classification was manual, slow, and error-prone. Tracking data and model changes was difficult, and reproducibility was limited. After implementing the pipeline, the system automates classification in real-time, tracks datasets and experiments with DVC and MLflow, deploys models automatically via CI/CD, and monitors for data drift while maintaining security and scalability.
7. Outcomes & Impact
The pipeline provides real-time content classification and reduces operational complexity. Automation via CI/CD reduces manual retraining and deployment efforts. DVC and MLflow ensure reproducibility and consistent version control. Docker and VPC deployment, along with HTTPS and Cloudflare DNS, ensure scalable and secure operations. Data drift monitoring maintains reliability and accuracy in production.
8. Conclusion
The MLOps pipeline provides reliable, automated, and real-time classification of podcast content, reducing operational complexity while ensuring reproducibility, security, and scalability. It transforms raw podcast data into actionable recommendations efficiently and consistently.