Ibrahim Ghali

Podcast Semantic Search Pipeline

Event-Driven Podcast Intelligence Platform

1. Problem Statement

Most podcast platforms rely on shallow metadata (title, description, author) to power search and discovery.
This approach fails because podcasts are long, unstructured audio files where valuable topics are embedded deep inside conversations.

As a result:

  • Users cannot search podcasts by topics mentioned inside episodes
  • Search accuracy is low for MP3/MP4-based content
  • Raw audio files are expensive to store and unusable for analytics or ML
  • Existing systems do not scale with content volume

Example
A podcast discusses Gaza and references the Ottoman Empire.
Searching for “Ottoman Empire” returns no results because the term is not present in the podcast metadata.

2. Project Objective

Design and implement a scalable, event-driven pipeline that:

  • Converts unstructured podcast audio into searchable text
  • Enriches content using LLM-generated metadata
  • Enables accurate semantic search
  • Serves as a data foundation for analytics and ML pipelines
  • Treats Kafka as the Single Source of Truth (SSOT) for governance and reprocessing

3. Core Design Principles

  • Accuracy over latency (for STT and enrichment)
  • Event-driven architecture
  • Kafka as SSOT
  • Loose coupling between services
  • Production-oriented design under real resource constraints
  • Clear separation between ingestion, processing, enrichment, and consumption

4. High-Level Architecture

podcast semantic search pipeline architecture
Podcast Semantic Search Pipeline Architecture

The system is built around Kafka as the backbone.
All services communicate via events, not direct dependencies.

Main components:

  • RSS scraping service
  • STT service (Faster-Whisper)
  • LLM enrichment service (LLaMA 3.2B)
  • Kafka & Kafka Connect
  • Elasticsearch
  • FastAPI
  • Trino + Apache Superset
  • PostgreSQL (Superset metadata)

5. Kafka as Single Source of Truth (SSOT)

Kafka is the authoritative data layer in the system.

Why Kafka is critical

  • Multiple consumers depend on the same data (STT, LLM, search, analytics, ML)
  • Enables replayability without re-scraping or re-downloading audio
  • Preserves data lineage for governance
  • Allows independent scaling of producers and consumers
  • Feeds external MLOps pipelines for model training and versioning

Kafka topics represent facts, not intermediate files.

6. End-to-End Data Flow

6.1 Podcast Ingestion

  • RSS feeds are scraped
  • Metadata extracted:
    • Title
    • Duration
    • Publication date
    • Author
    • Audio URL
  • Audio files downloaded and stored

Produced Kafka topics:

  • raw_podcast_metadata
  • audio_ready

6.2 Speech-to-Text (STT)

Model Evaluation

  • Vosk: fast, low accuracy
  • Whisper / Faster-Whisper: higher accuracy, higher compute cost

Accuracy was prioritized due to downstream dependencies (search relevance, LLM quality, ML training).

Model Used

  • faster-distil-whisper-small.en

Performance Benchmarks

  • CPU (Intel i5 12th Gen, 12 cores):
    • ~35 minutes for a 3-hour podcast
  • GPU (RTX 3050, 4GB VRAM):
    • ~3–4 minutes for the same audio
    • GPU utilization reached ~100% during concurrent STT + LLM workloads

Optimizations Applied

  • Voice Activity Detection (VAD)
  • Batch processing
  • Chunked transcription

Produced Kafka topic:

  • transcription_chunks

6.3 LLM-Based Metadata Enrichment

Model Choice

  • LLaMA 3.2B
  • Selected due to resource constraints while maintaining acceptable quality

Processing Strategy

  • Fixed-size context windows
  • Each transcription chunk processed independently
  • Metadata generated:
    • Topics
    • Keywords
    • Descriptions

Produced Kafka topic:

  • enriched_metadata

6.4 Known Limitation & Designed Optimization

Identified Issue

  • Similar transcription chunks may generate redundant metadata
  • Aggregation across chunks can introduce noise

Proposed (Not Implemented) Optimization

  • Use an embedding model to:
    • Embed each transcription chunk
    • Compare semantic similarity
    • Skip LLM inference for redundant chunks

This was not implemented due to compute limitations and project scope, representing a conscious trade-off rather than a design flaw.

7. Search & Indexing

Elasticsearch

  • Kafka Connect Elasticsearch Sink consumes enriched_metadata
  • Indexed data includes:
    • Raw metadata
    • Transcriptions
    • LLM-generated topics and keywords

This enables:

  • Full-text search over transcription content
  • Metadata-based filtering
  • Improved recall and relevance

8. API Layer

FastAPI

  • Search podcasts by metadata and transcription content
  • Retrieve enriched podcast results
  • Track user actions (search, click, listen)

All user actions are:

  • Produced to Kafka
  • Indexed into a separate Elasticsearch index for analytics

9. Analytics & Visualization

Stack

  • Trino for analytical queries
  • Apache Superset for dashboards
  • PostgreSQL for Superset metadata

Planned KPIs

  • Search-to-play conversion rate
  • Most listened categories
  • Topic popularity
  • User engagement metrics

Dashboards were planned but not finalized, as project focus was on building the ingestion, enrichment, and MLOps-ready pipeline.

10. ML Pipeline Integration

Kafka topics act as training data sources for downstream ML pipelines:

  • LLM-generated metadata used as weak labels
  • Transcriptions used for feature extraction
  • User behavior data for personalization models

An external MLOps pipeline consumes Kafka topics for:

  • Model training
  • Retraining
  • Versioning

11. Containerization & Deployment

Containerization

All services are containerized and deployed using Docker Compose:

  • Scraper
  • STT service
  • LLM service
  • Kafka
  • Elasticsearch
  • Kafka Connect
  • Trino
  • FastAPI
  • Superset
  • PostgreSQL

Custom images (scraper, STT, LLM) are hosted on Docker Hub under:
ibrahimghali

Repository:
https://hub.docker.com/repositories/ibrahimghali

Infrastructure Used

  • CPU: Intel i5 12th Gen (12 cores) for initial testing
  • GPU: RTX 3050 (4GB VRAM)
    • High utilization during concurrent STT + LLM workloads
    • Demonstrates real-world resource constraints

Scaling Considerations

  • Kafka enables horizontal scaling of producers and consumers
  • Elasticsearch and Kafka should be deployed as clusters in production
  • Shared volumes used for development only
  • NFS or Ceph recommended for production storage

12. Setup and Initialization

Kafka Connect Internal Topics

Topic Config

docker exec -it kafka kafka-topics --create \
  --topic connect-offsets \
  --bootstrap-server localhost:9092 \
  --replication-factor 1 \
  --partitions 1 \
  --config cleanup.policy=compact

docker exec -it kafka kafka-topics --create \
  --topic connect-configs \
  --bootstrap-server localhost:9092 \
  --replication-factor 1 \
  --partitions 1 \
  --config cleanup.policy=compact

docker exec -it kafka kafka-topics --create \
  --topic connect-status \
  --bootstrap-server localhost:9092 \
  --replication-factor 1 \
  --partitions 1 \
  --config cleanup.policy=compact

STT Model Setup

Stt Model Setup

mkdir -p models/systran-distil-small
cd models/systran-distil-small

wget https://huggingface.co/Systran/faster-distil-whisper-small.en/resolve/main/model.bin
wget https://huggingface.co/Systran/faster-distil-whisper-small.en/resolve/main/tokenizer.json
wget https://huggingface.co/Systran/faster-distil-whisper-small.en/resolve/main/preprocessor_config.json

Elasticsearch Sink Connector

ES Sink

curl -X POST http://localhost:8083/connectors \
  -H "Content-Type: application/json" \
  -d '{
    "name": "es-sink",
    "config": {
      "connector.class": "io.confluent.connect.elasticsearch.ElasticsearchSinkConnector",
      "tasks.max": "1",
      "topics": "enriched_metadata",
      "key.ignore": "true",
      "schema.ignore": "true",
      "connection.url": "http://elasticsearch:9200",
      "type.name": "_doc",
      "behavior.on.null.values": "delete"
    }
  }'

curl http://localhost:8083/connectors/es-sink/status


13. Data Architecture (Medallion Model)

The pipeline follows a medallion architecture, stored in a shared Docker volume (../data).

Shared volumes were used for development only. NFS was planned for production but not deployed due to resource limitations.

Layers

  • Bronze
    • Raw audio files (.mp3)
    • Raw scraped metadata (.json)
  • Silver
    • Cleaned transcripts (.txt)
    • Structured episode metadata (.json)
  • Gold
    • LLM-enriched metadata (.json)
    • Aggregated outputs
    • Chunk-level enrichment results

The dataset contains:

  • 12 directories
  • 266 files (development snapshot)

14. Final Summary

This project demonstrates how unstructured podcast audio can be transformed into searchable, enriched, and ML-ready data using an event-driven architecture. Kafka serves as the system’s single source of truth, enabling scalability, governance, and future ML pipelines, while all components are designed and deployed under real-world infrastructure constraints.