Podcast Semantic Search Pipeline

1. Problem Statement
Most podcast platforms rely on shallow metadata (title, description, author) to power search and discovery.
This approach fails because podcasts are long, unstructured audio files where valuable topics are embedded deep inside conversations.
As a result:
- Users cannot search podcasts by topics mentioned inside episodes
- Search accuracy is low for MP3/MP4-based content
- Raw audio files are expensive to store and unusable for analytics or ML
- Existing systems do not scale with content volume
Example
A podcast discusses Gaza and references the Ottoman Empire.
Searching for “Ottoman Empire” returns no results because the term is not present in the podcast metadata.
2. Project Objective
Design and implement a scalable, event-driven pipeline that:
- Converts unstructured podcast audio into searchable text
- Enriches content using LLM-generated metadata
- Enables accurate semantic search
- Serves as a data foundation for analytics and ML pipelines
- Treats Kafka as the Single Source of Truth (SSOT) for governance and reprocessing
3. Core Design Principles
- Accuracy over latency (for STT and enrichment)
- Event-driven architecture
- Kafka as SSOT
- Loose coupling between services
- Production-oriented design under real resource constraints
- Clear separation between ingestion, processing, enrichment, and consumption
4. High-Level Architecture
The system is built around Kafka as the backbone.
All services communicate via events, not direct dependencies.
Main components:
- RSS scraping service
- STT service (Faster-Whisper)
- LLM enrichment service (LLaMA 3.2B)
- Kafka & Kafka Connect
- Elasticsearch
- FastAPI
- Trino + Apache Superset
- PostgreSQL (Superset metadata)
5. Kafka as Single Source of Truth (SSOT)
Kafka is the authoritative data layer in the system.
Why Kafka is critical
- Multiple consumers depend on the same data (STT, LLM, search, analytics, ML)
- Enables replayability without re-scraping or re-downloading audio
- Preserves data lineage for governance
- Allows independent scaling of producers and consumers
- Feeds external MLOps pipelines for model training and versioning
Kafka topics represent facts, not intermediate files.
6. End-to-End Data Flow
6.1 Podcast Ingestion
- RSS feeds are scraped
- Metadata extracted:
- Title
- Duration
- Publication date
- Author
- Audio URL
- Audio files downloaded and stored
Produced Kafka topics:
raw_podcast_metadataaudio_ready
6.2 Speech-to-Text (STT)
Model Evaluation
- Vosk: fast, low accuracy
- Whisper / Faster-Whisper: higher accuracy, higher compute cost
Accuracy was prioritized due to downstream dependencies (search relevance, LLM quality, ML training).
Model Used
faster-distil-whisper-small.en
Performance Benchmarks
- CPU (Intel i5 12th Gen, 12 cores):
- ~35 minutes for a 3-hour podcast
- GPU (RTX 3050, 4GB VRAM):
- ~3–4 minutes for the same audio
- GPU utilization reached ~100% during concurrent STT + LLM workloads
Optimizations Applied
- Voice Activity Detection (VAD)
- Batch processing
- Chunked transcription
Produced Kafka topic:
transcription_chunks
6.3 LLM-Based Metadata Enrichment
Model Choice
- LLaMA 3.2B
- Selected due to resource constraints while maintaining acceptable quality
Processing Strategy
- Fixed-size context windows
- Each transcription chunk processed independently
- Metadata generated:
- Topics
- Keywords
- Descriptions
Produced Kafka topic:
enriched_metadata
6.4 Known Limitation & Designed Optimization
Identified Issue
- Similar transcription chunks may generate redundant metadata
- Aggregation across chunks can introduce noise
Proposed (Not Implemented) Optimization
- Use an embedding model to:
- Embed each transcription chunk
- Compare semantic similarity
- Skip LLM inference for redundant chunks
This was not implemented due to compute limitations and project scope, representing a conscious trade-off rather than a design flaw.
7. Search & Indexing
Elasticsearch
- Kafka Connect Elasticsearch Sink consumes
enriched_metadata - Indexed data includes:
- Raw metadata
- Transcriptions
- LLM-generated topics and keywords
This enables:
- Full-text search over transcription content
- Metadata-based filtering
- Improved recall and relevance
8. API Layer
FastAPI
- Search podcasts by metadata and transcription content
- Retrieve enriched podcast results
- Track user actions (search, click, listen)
All user actions are:
- Produced to Kafka
- Indexed into a separate Elasticsearch index for analytics
9. Analytics & Visualization
Stack
- Trino for analytical queries
- Apache Superset for dashboards
- PostgreSQL for Superset metadata
Planned KPIs
- Search-to-play conversion rate
- Most listened categories
- Topic popularity
- User engagement metrics
Dashboards were planned but not finalized, as project focus was on building the ingestion, enrichment, and MLOps-ready pipeline.
10. ML Pipeline Integration
Kafka topics act as training data sources for downstream ML pipelines:
- LLM-generated metadata used as weak labels
- Transcriptions used for feature extraction
- User behavior data for personalization models
An external MLOps pipeline consumes Kafka topics for:
- Model training
- Retraining
- Versioning
11. Containerization & Deployment
Containerization
All services are containerized and deployed using Docker Compose:
- Scraper
- STT service
- LLM service
- Kafka
- Elasticsearch
- Kafka Connect
- Trino
- FastAPI
- Superset
- PostgreSQL
Custom images (scraper, STT, LLM) are hosted on Docker Hub under:ibrahimghali
Repository:
https://hub.docker.com/repositories/ibrahimghali
Infrastructure Used
- CPU: Intel i5 12th Gen (12 cores) for initial testing
- GPU: RTX 3050 (4GB VRAM)
- High utilization during concurrent STT + LLM workloads
- Demonstrates real-world resource constraints
Scaling Considerations
- Kafka enables horizontal scaling of producers and consumers
- Elasticsearch and Kafka should be deployed as clusters in production
- Shared volumes used for development only
- NFS or Ceph recommended for production storage
12. Setup and Initialization
Kafka Connect Internal Topics
Topic Config
docker exec -it kafka kafka-topics --create \
--topic connect-offsets \
--bootstrap-server localhost:9092 \
--replication-factor 1 \
--partitions 1 \
--config cleanup.policy=compact
docker exec -it kafka kafka-topics --create \
--topic connect-configs \
--bootstrap-server localhost:9092 \
--replication-factor 1 \
--partitions 1 \
--config cleanup.policy=compact
docker exec -it kafka kafka-topics --create \
--topic connect-status \
--bootstrap-server localhost:9092 \
--replication-factor 1 \
--partitions 1 \
--config cleanup.policy=compact
STT Model Setup
Stt Model Setup
mkdir -p models/systran-distil-small
cd models/systran-distil-small
wget https://huggingface.co/Systran/faster-distil-whisper-small.en/resolve/main/model.bin
wget https://huggingface.co/Systran/faster-distil-whisper-small.en/resolve/main/tokenizer.json
wget https://huggingface.co/Systran/faster-distil-whisper-small.en/resolve/main/preprocessor_config.jsonElasticsearch Sink Connector
ES Sink
curl -X POST http://localhost:8083/connectors \
-H "Content-Type: application/json" \
-d '{
"name": "es-sink",
"config": {
"connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
"tasks.max": "1",
"topics": "enriched_metadata",
"key.ignore": "true",
"schema.ignore": "true",
"connection.url": "http://elasticsearch:9200",
"type.name": "_doc",
"behavior.on.null.values": "delete"
}
}'
curl http://localhost:8083/connectors/es-sink/status13. Data Architecture (Medallion Model)
The pipeline follows a medallion architecture, stored in a shared Docker volume (../data).
Shared volumes were used for development only. NFS was planned for production but not deployed due to resource limitations.
Layers
- Bronze
- Raw audio files (
.mp3) - Raw scraped metadata (
.json)
- Raw audio files (
- Silver
- Cleaned transcripts (
.txt) - Structured episode metadata (
.json)
- Cleaned transcripts (
- Gold
- LLM-enriched metadata (
.json) - Aggregated outputs
- Chunk-level enrichment results
- LLM-enriched metadata (
The dataset contains:
- 12 directories
- 266 files (development snapshot)
14. Final Summary
This project demonstrates how unstructured podcast audio can be transformed into searchable, enriched, and ML-ready data using an event-driven architecture. Kafka serves as the system’s single source of truth, enabling scalability, governance, and future ML pipelines, while all components are designed and deployed under real-world infrastructure constraints.