Podcast Semantic Search Pipeline

Event-Driven Podcast Intelligence Platform

1. Problem Statement

Most podcast platforms rely on shallow metadata (title, description, author) to power search and discovery.
This approach fails because podcasts are long, unstructured audio files where valuable topics are embedded deep inside conversations.

As a result:

Users cannot search podcasts by topics mentioned inside episodes
Search accuracy is low for MP3/MP4-based content
Raw audio files are expensive to store and unusable for analytics or ML
Existing systems do not scale with content volume

Example
A podcast discusses Gaza and references the Ottoman Empire.
Searching for “Ottoman Empire” returns no results because the term is not present in the podcast metadata.

2. Project Objective

Design and implement a scalable, event-driven pipeline that:

Converts unstructured podcast audio into searchable text
Enriches content using LLM-generated metadata
Enables accurate semantic search
Serves as a data foundation for analytics and ML pipelines
Treats Kafka as the Single Source of Truth (SSOT) for governance and reprocessing

3. Core Design Principles

Accuracy over latency (for STT and enrichment)
Event-driven architecture
Kafka as SSOT
Loose coupling between services
Production-oriented design under real resource constraints
Clear separation between ingestion, processing, enrichment, and consumption

4. High-Level Architecture

podcast semantic search pipeline architecture — Podcast Semantic Search Pipeline Architecture

The system is built around Kafka as the backbone.
All services communicate via events, not direct dependencies.

Main components:

RSS scraping service
STT service (Faster-Whisper)
LLM enrichment service (LLaMA 3.2B)
Kafka & Kafka Connect
Elasticsearch
FastAPI
Trino + Apache Superset
PostgreSQL (Superset metadata)

5. Kafka as Single Source of Truth (SSOT)

Kafka is the authoritative data layer in the system.

Why Kafka is critical

Multiple consumers depend on the same data (STT, LLM, search, analytics, ML)
Enables replayability without re-scraping or re-downloading audio
Preserves data lineage for governance
Allows independent scaling of producers and consumers
Feeds external MLOps pipelines for model training and versioning

Kafka topics represent facts, not intermediate files.

6. End-to-End Data Flow

6.1 Podcast Ingestion

RSS feeds are scraped
Metadata extracted:
- Title
- Duration
- Publication date
- Author
- Audio URL
Audio files downloaded and stored

Produced Kafka topics:

raw_podcast_metadata
audio_ready

6.2 Speech-to-Text (STT)

Model Evaluation

Vosk: fast, low accuracy
Whisper / Faster-Whisper: higher accuracy, higher compute cost

Accuracy was prioritized due to downstream dependencies (search relevance, LLM quality, ML training).

Model Used

faster-distil-whisper-small.en

Performance Benchmarks

CPU (Intel i5 12th Gen, 12 cores):
- ~35 minutes for a 3-hour podcast
GPU (RTX 3050, 4GB VRAM):
- ~3–4 minutes for the same audio
- GPU utilization reached ~100% during concurrent STT + LLM workloads

Optimizations Applied

Voice Activity Detection (VAD)
Batch processing
Chunked transcription

Produced Kafka topic:

transcription_chunks

6.3 LLM-Based Metadata Enrichment

Model Choice

LLaMA 3.2B
Selected due to resource constraints while maintaining acceptable quality

Processing Strategy

Fixed-size context windows
Each transcription chunk processed independently
Metadata generated:
- Topics
- Keywords
- Descriptions

Produced Kafka topic:

enriched_metadata

6.4 Known Limitation & Designed Optimization

Identified Issue

Similar transcription chunks may generate redundant metadata
Aggregation across chunks can introduce noise

Proposed (Not Implemented) Optimization

Use an embedding model to:
- Embed each transcription chunk
- Compare semantic similarity
- Skip LLM inference for redundant chunks

This was not implemented due to compute limitations and project scope, representing a conscious trade-off rather than a design flaw.

7. Search & Indexing

Elasticsearch

Kafka Connect Elasticsearch Sink consumes enriched_metadata
Indexed data includes:
- Raw metadata
- Transcriptions
- LLM-generated topics and keywords

This enables:

Full-text search over transcription content
Metadata-based filtering
Improved recall and relevance

8. API Layer

FastAPI

Search podcasts by metadata and transcription content
Retrieve enriched podcast results
Track user actions (search, click, listen)

All user actions are:

Produced to Kafka
Indexed into a separate Elasticsearch index for analytics

9. Analytics & Visualization

Stack

Trino for analytical queries
Apache Superset for dashboards
PostgreSQL for Superset metadata

Planned KPIs

Search-to-play conversion rate
Most listened categories
Topic popularity
User engagement metrics

Dashboards were planned but not finalized, as project focus was on building the ingestion, enrichment, and MLOps-ready pipeline.

10. ML Pipeline Integration

Kafka topics act as training data sources for downstream ML pipelines:

LLM-generated metadata used as weak labels
Transcriptions used for feature extraction
User behavior data for personalization models

An external MLOps pipeline consumes Kafka topics for:

Model training
Retraining
Versioning

11. Containerization & Deployment

Containerization

All services are containerized and deployed using Docker Compose:

Scraper
STT service
LLM service
Kafka
Elasticsearch
Kafka Connect
Trino
FastAPI
Superset
PostgreSQL

Custom images (scraper, STT, LLM) are hosted on Docker Hub under:
ibrahimghali

Repository:
https://hub.docker.com/repositories/ibrahimghali

Infrastructure Used

CPU: Intel i5 12th Gen (12 cores) for initial testing
GPU: RTX 3050 (4GB VRAM)
- High utilization during concurrent STT + LLM workloads
- Demonstrates real-world resource constraints

Scaling Considerations

Kafka enables horizontal scaling of producers and consumers
Elasticsearch and Kafka should be deployed as clusters in production
Shared volumes used for development only
NFS or Ceph recommended for production storage

12. Setup and Initialization

Kafka Connect Internal Topics

Topic Config

docker exec -it kafka kafka-topics --create \
  --topic connect-offsets \
  --bootstrap-server localhost:9092 \
  --replication-factor 1 \
  --partitions 1 \
  --config cleanup.policy=compact

docker exec -it kafka kafka-topics --create \
  --topic connect-configs \
  --bootstrap-server localhost:9092 \
  --replication-factor 1 \
  --partitions 1 \
  --config cleanup.policy=compact

docker exec -it kafka kafka-topics --create \
  --topic connect-status \
  --bootstrap-server localhost:9092 \
  --replication-factor 1 \
  --partitions 1 \
  --config cleanup.policy=compact

STT Model Setup

Stt Model Setup

mkdir -p models/systran-distil-small
cd models/systran-distil-small

wget https://huggingface.co/Systran/faster-distil-whisper-small.en/resolve/main/model.bin
wget https://huggingface.co/Systran/faster-distil-whisper-small.en/resolve/main/tokenizer.json
wget https://huggingface.co/Systran/faster-distil-whisper-small.en/resolve/main/preprocessor_config.json

Elasticsearch Sink Connector

ES Sink

curl -X POST http://localhost:8083/connectors \
  -H "Content-Type: application/json" \
  -d '{
    "name": "es-sink",
    "config": {
      "connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
      "tasks.max": "1",
      "topics": "enriched_metadata",
      "key.ignore": "true",
      "schema.ignore": "true",
      "connection.url": "http://elasticsearch:9200",
      "type.name": "_doc",
      "behavior.on.null.values": "delete"
    }
  }'

curl http://localhost:8083/connectors/es-sink/status

13. Data Architecture (Medallion Model)

The pipeline follows a medallion architecture, stored in a shared Docker volume (../data).

Shared volumes were used for development only. NFS was planned for production but not deployed due to resource limitations.

Layers

Bronze
- Raw audio files (.mp3)
- Raw scraped metadata (.json)
Silver
- Cleaned transcripts (.txt)
- Structured episode metadata (.json)
Gold
- LLM-enriched metadata (.json)
- Aggregated outputs
- Chunk-level enrichment results

The dataset contains:

12 directories
266 files (development snapshot)

14. Final Summary

This project demonstrates how unstructured podcast audio can be transformed into searchable, enriched, and ML-ready data using an event-driven architecture. Kafka serves as the system’s single source of truth, enabling scalability, governance, and future ML pipelines, while all components are designed and deployed under real-world infrastructure constraints.