Skip to content
← Back to blog

How We Built a Podcast Pipeline That Runs Itself

From manual episode production to a fully automated pipeline — Whisper transcription, Ollama show notes, FFmpeg processing, and multi-platform publishing.

·9 min read

How We Built a Podcast Pipeline That Runs Itself

A friend of ours runs a comedy podcast. Every week he'd record an episode, then spend three to four hours on post-production: trimming audio, writing show notes, generating social clips, uploading to Spotify and YouTube. He asked if we could help automate any of it.

We said we could probably automate all of it.

That was about six months ago. What started as a weekend favor turned into a 49-phase engineering project with 763 tests, multi-client support, and a pipeline that takes raw audio in one end and publishes finished episodes out the other. Here's how it works and what we learned building it.

The Architecture

The pipeline is a Python application that runs as a sequence of steps, each one feeding into the next. The core flow looks like this:

Audio Ingest > Whisper Transcription > Content Generation (Ollama) > FFmpeg Audio Processing > Multi-Platform Publishing

Each step is its own module in pipeline/steps/, orchestrated by a central runner. The runner maintains a context object that accumulates outputs as it moves through the chain:

class PipelineRunner:
    def __init__(self, config: Config, context: PipelineContext):
        self.config = config
        self.context = context
        self.steps = [
            IngestStep(),
            TranscriptionStep(),
            ContentStep(),
            AudioStep(),
            VideoStep(),
            PublishStep(),
        ]
 
    async def run(self):
        for step in self.steps:
            if not step.should_run(self.context):
                logger.info(f"Skipping {step.name} — not needed")
                continue
            await step.execute(self.config, self.context)

Each step has a should_run check so the pipeline can resume from where it left off if something fails. The context gets serialized to PostgreSQL between steps, so a network timeout on YouTube upload doesn't mean re-transcribing the whole episode.

Why Python Over Node

We went back and forth on this. Node would have been fine for the orchestration and API calls, but the audio processing story in Python is significantly better. Between pydub for audio manipulation, whisper bindings for transcription, and the ability to shell out to FFmpeg with well-typed wrappers, Python was the pragmatic choice.

The other factor: we were already running PyTorch with CUDA on an RTX 3070 for the Whisper model. Keeping everything in one runtime meant the transcription step could load the model once and reuse it across episodes, rather than spinning up a separate process.

FFmpeg: The Unglamorous Essential

We spend more time on FFmpeg command construction than we'd like to admit. It's the backbone of every audio and video operation in the pipeline, and getting the flags right matters more than it seems.

Here's a simplified version of how the pipeline builds clip commands:

def build_clip_command(
    source: Path,
    start: float,
    duration: float,
    output: Path,
    fade_duration: float = 0.5,
) -> list[str]:
    return [
        "ffmpeg", "-y",
        "-ss", str(start),
        "-i", str(source),
        "-t", str(duration),
        "-af", f"afade=t=in:d={fade_duration},afade=t=out:st={duration - fade_duration}:d={fade_duration}",
        "-c:a", "aac", "-b:a", "192k",
        str(output),
    ]

The real version handles video input too — vertical cropping for Instagram Reels, audio extraction from video sources, muxing processed audio back onto video clips. When a client uploads MP4 instead of WAV, the pipeline detects the format automatically and adjusts the processing chain.

One lesson we keep relearning: the "boring" tooling problems (FFmpeg flags, file format detection, audio normalization) take more debugging time than the "interesting" AI problems. Whisper just works. Getting consistent audio levels across clips from different recording setups? That's where the hours go.

Show Notes That Don't Sound Like a Robot

The content generation step uses Ollama to produce show notes, episode descriptions, social media posts, and topic highlights from the transcript. The key insight was that generic prompts produce generic output. Each client has a voice persona configuration:

# clients/comedy-podcast.yaml
voice_persona:
  style: "irreverent comedy, self-deprecating humor"
  avoid: "corporate language, inspirational quotes"
  examples:
    - "Two guys discuss whether cereal is soup. No conclusion reached."
    - "An argument about the best gas station snack escalates predictably."

The persona gets injected into every content generation prompt, so the show notes for a comedy podcast don't read like a TED talk summary. This was one of the first things we had to fix when we started thinking about multi-client support — the original version had the podcast's humor hardcoded everywhere.

The Multi-Client Problem

The single hardest architectural decision was making the pipeline configurable per client without rebuilding anything. The original version had the podcast name, censor word list, Dropbox paths, and YouTube credentials scattered across a dozen files.

We solved this with a YAML-based client config system. Each client gets a file in clients/, and a ClientConfig loader patches the global Config class at runtime:

The CLI accepts a --client flag, and everything downstream just reads from config. No conditional logic in the pipeline steps — they don't know or care which client is running.

What Automation Actually Looks Like

The happy path is beautiful: drop an audio file, run the command, come back in twenty minutes to a fully published episode with show notes, clips, and social posts.

The reality involves more babysitting than we'd like. Whisper occasionally hallucinates timestamps on quiet sections. Ollama sometimes generates show notes that miss the best segment of the episode. YouTube's API returns inscrutable errors when your OAuth token expires at 2 AM.

We've added retry logic with exponential backoff on the API calls, a compliance checker that flags potential copyright issues before publishing (learned that one the hard way after a YouTube strike), and dead-letter queues for failed uploads. But the pipeline still needs a human sanity check before publishing. Full autonomy is a goal, not a current reality.

The compliance checker was a direct response to getting a YouTube copyright strike on episode 29. Now every episode gets scanned for flagged content before the publish step runs. Sometimes the best features come from the worst days.

The Numbers

What We'd Change

If we started over, we'd separate the transcription service into its own deployable unit. Right now it runs in-process, which means the whole pipeline needs a GPU-capable host even though only one step uses the GPU. A transcription microservice that accepts audio and returns timestamped text would let the orchestrator run on a cheap VPS.

We'd also invest earlier in structured logging. We added it eventually, but the first few months of debugging production issues with print() statements were not our finest engineering hours.

The pipeline works. It runs itself for one client and is ready for more. The engineering challenge is solved — now the hard part is finding podcasters who want to pay for it. But that's a different kind of problem, and one we're still figuring out.


Want the same automation for your podcast? See how our pipeline works or book a free call to discuss your show.

Need Help Building Something Like This?

I help teams ship AI pipelines, automation systems, and full-stack apps. Book a free 15-minute call to talk about your project.