The Problem
Podcast production is tedious. For every hour of recorded audio, there are hours of post-production: transcribing, writing show notes, pulling highlight clips, generating social media snippets, uploading to hosting platforms. Most independent podcasters either burn out doing it manually or pay someone $200-500 per episode to handle it.
We wanted to see if we could build a pipeline where you drop in a raw audio file and everything else just happens — transcription, show notes, timestamps, highlight clips, multi-platform publishing. Not a wrapper around one API call, but a real production system that handles the messy edge cases: variable audio quality, multi-speaker detection, comedy timing that AI summaries tend to butcher, and the per-client customization that every podcast needs.
The Approach
The pipeline runs in 8 sequential stages, each isolated enough to retry independently if something fails. Raw audio comes in, FFmpeg normalizes it (loudness, format, sample rate), then Whisper transcribes with word-level timestamps. From there, Ollama generates show notes using configurable prompts — this matters because a comedy podcast needs different show notes than a business interview. Highlight detection identifies the most engaging segments using a combination of transcript analysis and audio energy levels. Finally, the publishing step pushes formatted content to each platform's API.
We built this across 49 phases using the GSD (Get Shit Done) methodology, which is our approach to breaking large projects into plannable, committable increments. Each phase has explicit requirements, a plan, execution, and a summary. It sounds rigid, but it is what lets a side project actually ship instead of languishing at 60% completion forever.
The hardest part was not the AI — it was the audio processing. FFmpeg has roughly infinite configuration options, and getting consistent output from wildly inconsistent input (phone recordings, USB mics, studio setups) required more debugging than the entire ML pipeline.
Technical Decisions
Whisper over cloud transcription APIs. We started with AssemblyAI and Deepgram. Both are solid, but per-minute pricing gets expensive when you are processing hours of audio across multiple clients. Running Whisper locally on an RTX 3070 (PyTorch CUDA 12.4) eliminated the recurring cost entirely. Transcription quality is comparable, and we have full control over the model size and language settings.
Ollama for show notes generation. We needed show notes that could be customized per client — different tone, different structure, different emphasis. Cloud LLM APIs work, but Ollama running locally means zero API costs and no rate limits during development. The prompts are stored in per-client YAML configuration files, so adding a new podcast client means writing a config file, not changing code.
Multi-client architecture from day one. Each podcast gets its own YAML config defining output directories, prompt templates, publishing targets, and audio processing settings. This was a deliberate choice — we wanted this to be ready to sell as a service, not just a personal tool. The config system handles everything from FFmpeg normalization parameters to which platforms receive the final output.
PostgreSQL for state tracking, not just flat files. Early versions wrote everything to disk. That works for one podcast, but when you are tracking transcription status, publishing history, and client billing across multiple shows, you need a real database. Railway hosts the PostgreSQL instance alongside the pipeline.
What We Learned
The v1.5 milestone added prospect outreach — we identified 4 potential podcast clients and built the tooling to onboard them. This was the transition from "interesting project" to "potential business." The pipeline can handle a new client in under an hour: create a YAML config, run a test episode through the pipeline, verify the outputs, and enable the cron schedule.
One thing we would do differently: we would have invested in a compliance checker earlier. YouTube flagged Episode 29 for a content policy violation that the automated pipeline did not catch. We built a compliance checker after that, but it should have been in the pipeline from the start. Automated systems need automated guardrails.
This project taught us that the gap between "works for me" and "works for anyone" is mostly configuration management and error handling — not the core algorithm. The AI and audio processing were solved relatively quickly. Making it robust enough that a non-technical podcast host could rely on it took the other 40 phases.
Read the full technical deep-dive on how we built a podcast pipeline that runs itself.
Want a similar pipeline for your podcast? Book a free call to discuss automation.