Case Study
00:51:15 Kristin (Bixie): I think you're great. I like you a lot. I respect you, I love a woman in charge. But I would like to understand how this city has a fucking dragon? 00:51:22 Kristin (Bixie): And we're not calling that as an ally. Like, who is this just on your situation? We have a dragon? 00:51:30 Kristin (Bixie): fight with us, like, I just don't understand why we're not calling that before we call a group of mercenaries. 00:52:45 Connor Koblinski (DM/Grand Maven): our most powerful chess piece. This city has been protected by Jamun-Saar for centuries. If he were to leave this plain, we would be open to attack. 00:53:10 Kristin (Bixie): Listen, I'll talk to Jason. He's a dragon, and there's a seismic shifting of the earth beneath his feet with devils and fucking ghosts, and it's absolutely despicable down there, he's just sitting on his throne. 00:53:18 Kristin (Bixie): feet crossed? Like, I want to talk to this dragon while he's not fighting for his people. 00:53:35 Kristin (Bixie): I don't like any of that. All this bureaucracy, I'm not testifying. Let's just get the guy to come fight with us against the shit we just saw. 00:53:42 Connor Koblinski (DM/Grand Maven): She rolls a 19, she says, I can have you meet by... with him tomorrow, morning at 8 AM. 00:53:48 Kristin (Bixie): I'm available. 00:53:55 Lex (Eas): Bix, let me... Maybe I'll come with you, and just, like, watch you. 00:53:58 Lex (Eas): I'll be your muscle. 00:54:02 Kristin (Bixie): Sure. I love dragons. I think they're very powerful and very cool. Why is everyone, like, weirdly nervous all the time? 00:54:10 Kristin (Bixie): We're just gonna go talk to this guy and, like, get him to fight with us. It's... we have a valuable cause. The fucking ground is shifting.
I'm a professional Dungeon Master, and I'm always experimenting with fun, engaging recaps for my players. This is what happens when you point AI at a 3-hour recording of friends playing D&D and ask it to find the best 60 seconds.
The challenge: a 3-hour session is ~2,000 lines of cross-talk, laughter, rule debates, and snack breaks. Buried inside are 5-7 moments of genuine storytelling magic. A human editor finds them intuitively. An AI needs a map.
Building that map is what this project is about.
Each stage transforms the data. Click any phase to see what actually happens inside.
Turn a messy video transcript into structured data
WEBVTT 1 00:00:01.000 --> 00:00:05.000 Connor Koblinski (DM): Alright, welcome back everyone. Let's do a quick recap. 2 00:00:06.000 --> 00:00:15.000 Connor Koblinski (DM): So last session, you all made it to the Astral Sea. You're on the remains of a dead god's skeleton, this massive floating corpse drifting through silver mist...
{ "cueId": 2, "speaker": "Connor Koblinski (DM)", "text": "So last session, you all made it to the Astral Sea...", "startMs": 6000, "endMs": 15000, "isDM": true, "durationSec": 9 }
The parser doesn't rely on a hardcoded name. It analyzes speaking patterns across the full session: who speaks the most, who has the longest utterances, who speaks after long silences. The person matching that profile gets flagged as DM.
Zoom generates two transcript types: transcript.vtt (with speaker labels) and cc.vtt (auto-caption, no speakers). The parser handles both, inferring speakers from timing patterns in the caption-only format.
Zoom labels show real names: "Kristin (Bixie)". The system maps these to canonical character names using a knowledge base, so downstream stages see "Bixie" not "Kristin."
Analyze audio for energy spikes, silence patterns, and emotional peaks
ffmpeg scans the full audio track for loudness peaks. The system identifies the top 7 energy spikes, ranked by intensity. A spike right after a long silence is weighted higher - that's the "dramatic reveal" pattern.
Gemini 2.5 Flash listens to 3-minute audio clips around each peak. It returns structured data: laughter intensity, gasps, voice acting quality, table energy. This is one of the cheapest AI calls in the pipeline (~$0.02 per clip).
In D&D, the biggest moments follow the same audio pattern: the DM says something dramatic, the table goes quiet, then erupts. The system specifically looks for these silence-then-spike patterns because they almost always mark a pivotal story beat.
Out of 2,000 lines, identify the 5-7 that matter
The AI classifies every moment it finds into one of 8 types. Each type has different visual treatment downstream:
Every moment gets a four-beat emotional arc: Setup, Build, Peak, Payoff. This isn't decoration - it drives the Director's storyboard decisions in Phase 5. A moment without a clear arc gets ranked lower.
For the "Bixie Looting" moment, the AI wrote: "One player has a profound spiritual moment, another immediately starts looting. The punchline 'That's just economics' is quotable and perfectly in character." That's not just identification - it's editorial judgment about why the moment works.
Turn fantasy jargon into something an image AI can actually draw
Image generation AI has no idea what "Bazzoxan" is. If you send it the word raw, you get garbage. But "a dark basalt military outpost with scarred walls under gray-green clouds" gives it something it can actually render. This translation layer is what lets different AI models collaborate without corrupting each other's data.
A zero-cost safety net catches any raw proper nouns that slip past the translation layer. Simple string matching with smart article handling - "the Astral Sea" becomes "an infinite expanse of shimmering void" without doubling up articles. Zero API cost, prevents expensive image generation failures.
This entire stage runs on Claude Haiku in parallel with other analysis. It processes in under 14 seconds and adds functionally nothing to the bill. But without it, every downstream image generation call would be working with nonsense.
An AI director plans the storyboard, shot by shot
Click a sequence block to see what the Director planned
This is the most expensive AI call in the pipeline - and the most important. The Director decides what the audience sees, in what order, for how long. It needs to understand narrative pacing, comedic timing, visual composition, and emotional arcs simultaneously. Cheaper models produce technically correct but creatively flat storyboards.
The Director chooses a framing strategy based on moment type: Cold Open (start with action), Stakes-Then-Payoff (front-load tension), Character Showcase (one person's journey), Comedic Timing (setup-punchline), Crescendo Build (steady escalation). Each strategy changes the shot order and pacing.
After the Director returns its plan, 5 automated guardrails enforce constraints: cue order verification, DM fabrication limits, ai-generated line caps, structural validation, and table-talk QC. The Director is creative. The guardrails keep it honest.
Five layers of checking before a single pixel is drawn
Validates timing math, field presence, duration bounds. Every sequence type has required fields - dialogue needs a speaker, dice moments need a roll value. Hard pass/fail.
Reviews three dimensions: Framing & Pacing (does it hook in 2 seconds?), Character Fidelity (do portraits match character cards?), Scene Coherence (does the background fit the story?). Needs 2 of 3 to pass. Retries up to 3x.
Passed (2/3) — Real QC catch from Session 103: the AI confused the player for the character
Seven-dimension scoring rubric. Each dimension gets 0-20 points. This is a report card, not a gate - it flags weaknesses for the human to review without blocking the pipeline.
Score: 105/140 (B) — Real scores from Session 103
After images are generated, Gemini Vision inspects its own work. Three separate checks: portrait consistency (does the face match the character card?), background accuracy (does the setting match the scene?), and action frame coherence (do sequential frames look like the same character?).
The "Script Supervisor" - checks that visual details stay consistent across the entire moment. Does the same character look the same in sequence 1 and sequence 6? Does the background palette drift between shots in the same location?
Early in development, I gave image generation an unlimited retry budget. The QC would fail a frame, regenerate it, fail again, regenerate - 200+ images for a single sequence. Typical generation cost: $0.40. That run: $6+.
The fix wasn't "better AI." It was better rubrics and hard retry caps. AI quality control needs the same thing human quality control needs: clear criteria, not vibes.
Two of the five QC layers cost literally nothing - they're rules-based string matching and field validation. The other three cost a combined $0.012. The most expensive single image generation costs more than all five QC layers combined. Prevention is cheaper than regeneration, every time.
Each QC dimension has specific, measurable criteria. "Is the character fidelity good?" is a bad check. "Does the portrait description match the canonical character card within 3 key visual attributes (hair color, skin tone, signature items)?" is a good one. If you can't explain what "good" means in specific terms, you can't explain it to an AI either.
Decide what you hear: full performance, a quick bark, or just text
Japanese RPGs solved this exact problem decades ago. You can't voice every line in a 60-hour game, so you tier them: important scenes get full voice, supporting dialogue gets a bark (a short vocalization), and exposition gets text with a character beep. The same logic applies to a 60-second Short with 8-12 dialogue lines.
Full Voice goes to: climactic lines, emotional peaks, great punchlines, dramatic NPC voices. Bark goes to: supporting dialogue that benefits from hearing the voice. Text Only is reserved for AI-generated bridge text where no audio recording exists. Hard rule: if real recorded audio exists, it's never text-only.
Turn descriptions into 16-bit pixel art portraits, backgrounds, and action frames
"Lithe half-elf woman with white-blonde hair swept up in a high bun, pointed ears. Fair skin, sharp angular cheekbones, sly confident smirk. Dark leather vest over white shirt..."
Each character gets a canonical 512x512 "token" portrait. Mood variants and mouth animation frames are generated using the token as a reference image, ensuring visual consistency. If bixie_triumphant already exists in the knowledge base, the system skips generation entirely.
Image generation is 95% of the per-video cost. The entire analysis pipeline (parsing, highlights, summary, director, QC) costs ~$0.11. Image generation costs ~$0.45-$0.57. That's why asset reuse matters - if a character portrait already exists, we skip the most expensive call entirely.
16-bit SNES-era aesthetic: think Octopath Traveler, Final Fantasy VI, Chrono Trigger. Every image generation prompt starts with the same style prefix to maintain consistency across sessions, characters, and backgrounds.
Parts become a whole - HTML animation, then MP4 video
Each scene is first assembled as a self-contained HTML page with CSS animations, dialogue timing, and embedded audio. This gives a live preview before committing to video export. Puppeteer then captures frames from the browser, and ffmpeg encodes them into the final MP4.
For manual editing, the system also exports a Premiere Pro-compatible XML with all clips on separate tracks. Room tone goes on an independent A2 track. This lets a human editor make final adjustments without re-running the pipeline.
How I think about building AI-powered creative systems
I told Claude: you are the CTO of this project. Research the wise answer to technical questions. Execute solutions that drive our bottom-line metrics.
It found Apple's local audio analysis models on my machine. I didn't even know I had access to them. That discovery saved real money by running processes locally instead of hitting paid APIs for every audio analysis call.
The catch: this only works if you slow down and actually understand what it's proposing. You're the CEO - you set the vision, you approve the architecture, you make the judgment calls about what matters. But you don't have to be the one who knows the fastest way to extract audio peaks from a 3-hour file. That's the CTO's job.
When Hollywood adapts a book into a movie, they don't hand it to one person. There's a director, a script supervisor, a line editor, a lore manager, a QC team. I built my AI pipeline the same way.
When you're stringing AI calls together, every handoff is a potential break point.
Here's a real example: the Director reorders dialogue lines for dramatic effect. The old system used line numbers to match audio to dialogue. Line numbers shifted when the Director moved things around. Audio beds broke. Video and audio desynced by 81 seconds.
The fix: immutable IDs from the original transcript (cueIds) as the single source of truth across all 13 pipeline stages. The Director can reorder whatever it wants - the cueId never changes.
Same principle with proper noun translation. Gemini can't draw "Bazzoxan." But it can draw "a dark basalt military outpost with scarred walls." The translation layer is what lets different AI models collaborate without corrupting each other's data.
AI gets you 80% of the way there. The last 20% is the entire project.
It always feels like one more data point will close the gap. If only the AI understood that the laughter was loud vs. quiet. If only it could detect sarcasm. If only it knew the emotional history between these two characters.
You can think of a thousand data points. But what you're evaluating as a human - especially around art and communication with other people - is so minute and fine-tuned that it doesn't matter if you have every signal. You still benefit from a human interjecting 10-20% of the vision.
So I stopped building an automator and started building a tool that presents information to a director and gets quick decisions back.
What I'd do differently and why it matters
I started with text because it felt lightweight. I got a great animation engine out of it. Then I went back to add audio - and had to restructure everything.
What I thought I was doing: building incrementally. Text first, audio later, keep it simple.
What actually happened: the text-first architecture made assumptions about timing, sequencing, and data flow that audio broke. The animation engine assumed character-count-based timing. Audio has real durations. The two systems disagreed on how long every single line should be on screen.
If I could redo it, I'd do a lightweight test of my most complex data problem first - even a rough prototype of the audio integration - before building the visual pipeline around text-only assumptions.
I added energy analysis. Then emotion detection. Then laughter intensity. Then voice acting quality scores. Each one helped - a little.
But the moments the AI consistently missed were the ones that required understanding relationships between players, inside jokes, callbacks to earlier sessions. Things that are technically "data" but practically impossible to capture in a pipeline.
There's a seductive logic: if only I add one more signal, the AI will finally get it. It won't. The gap between "good enough to be useful" and "good enough to replace a human editor" is not a data problem. It's a judgment problem. And judgment is what humans are for.
Early on, my quality control was basically "check if this is good." That's not a rubric. That's a vibe.
The $6 story: I gave image generation an unlimited retry budget. QC would fail a frame, regenerate, fail again, regenerate - 200+ images for one sequence. Typical cost: $0.40. That run: $6+. And the output wasn't even better, because the QC criteria were vague enough that "passing" was essentially random.
The fix: specific, measurable rubrics (seven-dimension scoring), hard retry caps, layered checks at different cost tiers. Two of my five QC layers cost literally zero dollars - they're rules-based checks that don't need AI at all.
Every AI call should justify its existence.
Originally, the system analyzed all 7 highlighted moments upfront with full audio processing. That's ~$2.29 each - about $16 total - before a human even looked at the results. Most of those moments get rejected.
The fix: deferred processing. The system does lightweight analysis on everything, presents the results to a human, and only runs the expensive audio pipeline on the single moment the human selects. That one change saved ~85% of the audio budget.
Model tiering matters too. Haiku at $0.001 per call handles validation. Sonnet at $0.005 handles creative review. Opus handles the Director call because that's the one place where creative judgment actually moves the needle. Using Opus for everything would cost ~$15 per video instead of $0.60.
I'm Connor Koblinski. I'm an educator, an AI trainer, and a professional Dungeon Master. I've built curriculum for Ramp, Snapchat, Niantic, and Solana. I teach people to use AI for things they couldn't do before - not to automate things they already know how to do.
This project started because I wanted better recaps for my D&D players. It turned into a 22,000-line AI orchestration pipeline that taught me more about data engineering, creative AI, and system design than any course or certification could.
The whole thing is a little silly. Pixel art animations of tabletop role-playing games. But underneath it is real science: multi-model orchestration, structured data pipelines, rubric-based quality control, cost optimization, and the hard problem of turning unstructured human interaction into meaningful data.
I believe AI should expand what you can do, not just speed up what you already do. This project is proof of that philosophy.