DND Short Factory — Case Study

The Pipeline

Each stage transforms the data. Click any phase to see what actually happens inside.

Parse the Chaos

Turn a messy video transcript into structured data

$0.00

What Goes In

WEBVTT

1
00:00:01.000 --> 00:00:05.000
Connor Koblinski (DM): Alright, welcome
back everyone. Let's do a quick recap.

2
00:00:06.000 --> 00:00:15.000
Connor Koblinski (DM): So last session,
you all made it to the Astral Sea.
You're on the remains of a dead god's
skeleton, this massive floating corpse
drifting through silver mist...

What Comes Out

{
  "cueId": 2,
  "speaker": "Connor Koblinski (DM)",
  "text": "So last session, you all
    made it to the Astral Sea...",
  "startMs": 6000,
  "endMs": 15000,
  "isDM": true,
  "durationSec": 9
}

How It Detects the DM

The parser doesn't rely on a hardcoded name. It analyzes speaking patterns across the full session: who speaks the most, who has the longest utterances, who speaks after long silences. The person matching that profile gets flagged as DM.

Two VTT Formats

Zoom generates two transcript types: transcript.vtt (with speaker labels) and cc.vtt (auto-caption, no speakers). The parser handles both, inferring speakers from timing patterns in the caption-only format.

Speaker Aliases

Zoom labels show real names: "Kristin (Bixie)". The system maps these to canonical character names using a knowledge base, so downstream stages see "Bixie" not "Kristin."

Listen to the Room

Analyze audio for energy spikes, silence patterns, and emotional peaks

~$0.08

Energy Spike Silence (then spike) Group Laughter

Energy Analysis

ffmpeg scans the full audio track for loudness peaks. The system identifies the top 7 energy spikes, ranked by intensity. A spike right after a long silence is weighted higher - that's the "dramatic reveal" pattern.

Emotion Detection

Gemini 2.5 Flash listens to 3-minute audio clips around each peak. It returns structured data: laughter intensity, gasps, voice acting quality, table energy. This is one of the cheapest AI calls in the pipeline (~$0.02 per clip).

The "Silence-Then-Spike" Pattern

In D&D, the biggest moments follow the same audio pattern: the DM says something dramatic, the table goes quiet, then erupts. The system specifically looks for these silence-then-spike patterns because they almost always mark a pivotal story beat.

Find the Moments

Out of 2,000 lines, identify the 5-7 that matter

~$0.04

Moment Taxonomy

The AI classifies every moment it finds into one of 8 types. Each type has different visual treatment downstream:

epic_roll funny_banter dramatic_reveal combat_climax character_moment table_chaos atmosphere_dread treasure_reward

Emotional Arc Extraction

Every moment gets a four-beat emotional arc: Setup, Build, Peak, Payoff. This isn't decoration - it drives the Director's storyboard decisions in Phase 5. A moment without a clear arc gets ranked lower.

The Actual Output

For the "Bixie Looting" moment, the AI wrote: "One player has a profound spiritual moment, another immediately starts looting. The punchline 'That's just economics' is quotable and perfectly in character." That's not just identification - it's editorial judgment about why the moment works.

Translate the World

Turn fantasy jargon into something an image AI can actually draw

~$0.002

Fantasy Term Alyxian the Apotheon

What the Image AI Receives A noble warrior-champion in battered golden armor with a fading divine halo. A 9-foot-tall hero, his armor scarred but radiant, his eyes weary but compassionate.

Fantasy Term Cael Morrow

What the Image AI Receives A drowned ancient city submerged beneath a desert oasis. Massive stone columns covered in barnacles loom in murky blue-green water. Schools of bioluminescent fish drift between ruined archways.

Fantasy Term Vermilion Dream

What the Image AI Receives Shadowy figures wearing fine clothing with subtle crimson accents - red pins, garnet rings, vermilion thread. Outwardly respectable but privately corrupt.

Fantasy Term Grummish (the orc god)

What the Image AI Receives A monstrous divine figure - a 22-foot-tall hulking orc god wielding a massive spear tipped with his own divine heart. A towering silhouette of raw godly strength.

Why This Matters

Image generation AI has no idea what "Bazzoxan" is. If you send it the word raw, you get garbage. But "a dark basalt military outpost with scarred walls under gray-green clouds" gives it something it can actually render. This translation layer is what lets different AI models collaborate without corrupting each other's data.

The Prompt Sanitizer

A zero-cost safety net catches any raw proper nouns that slip past the translation layer. Simple string matching with smart article handling - "the Astral Sea" becomes "an infinite expanse of shimmering void" without doubling up articles. Zero API cost, prevents expensive image generation failures.

Cost: $0.002

This entire stage runs on Claude Haiku in parallel with other analysis. It processes in under 14 seconds and adds functionally nothing to the bill. But without it, every downstream image generation call would be working with nonsense.

Direct the Story

An AI director plans the storyboard, shot by shot

~$0.03-$0.30

"Bixie on the Dragon: He's Just Sitting There, Feet Crossed?" 10 sequences | 160 seconds | $2.84 total

Dialogue

Bixie

18s

$0.24

Dialogue

Bixie

14s

$0.24

Dialogue

Grand Maven

16s

$0.24

DM Narration

$0.32

Dialogue

Bixie

22s

$0.24

Action

$0.32

Dialogue

Bixie

20s

$0.24

Dialogue

Grand Maven

14s

$0.24

Dialogue

Bixie

$0.24

Dialogue

Eas + Bixie

18s

$0.24

Click a sequence block to see what the Director planned

Why Opus?

This is the most expensive AI call in the pipeline - and the most important. The Director decides what the audience sees, in what order, for how long. It needs to understand narrative pacing, comedic timing, visual composition, and emotional arcs simultaneously. Cheaper models produce technically correct but creatively flat storyboards.

5 Framing Strategies

The Director chooses a framing strategy based on moment type: Cold Open (start with action), Stakes-Then-Payoff (front-load tension), Character Showcase (one person's journey), Comedic Timing (setup-punchline), Crescendo Build (steady escalation). Each strategy changes the shot order and pacing.

Post-Director Guardrails

After the Director returns its plan, 5 automated guardrails enforce constraints: cue order verification, DM fabrication limits, ai-generated line caps, structural validation, and table-talk QC. The Director is creative. The guardrails keep it honest.

Quality Control

Five layers of checking before a single pixel is drawn

~$0.012

Technical QC Claude Haiku $0.001

Validates timing math, field presence, duration bounds. Every sequence type has required fields - dialogue needs a speaker, dice moments need a roll value. Hard pass/fail.

✓ Duration bounds: 3-20s per sequence

✓ All required fields present

✓ Timing math: 800 + (chars x 88) + pauses

✓ Total duration within moment bounds

Creative QC Claude Sonnet $0.005

Reviews three dimensions: Framing & Pacing (does it hook in 2 seconds?), Character Fidelity (do portraits match character cards?), Scene Coherence (does the background fit the story?). Needs 2 of 3 to pass. Retries up to 3x.

✓ Cinematic Pacing: hooks immediately, tight escalation

✗ Character Fidelity: "Sequence 11 describes portrait as 'Kristin' (the real player) instead of 'Bixie' (the character). Must use character card, not player name."

✓ Scene Coherence: chancellor's chambers consistent throughout

Passed (2/3) — Real QC catch from Session 103: the AI confused the player for the character

Storytelling QA Rules-Based $0.00

Seven-dimension scoring rubric. Each dimension gets 0-20 points. This is a report card, not a gate - it flags weaknesses for the human to review without blocking the pipeline.

Plot (Arc Structure)

16/20

Tone (Mood Alignment)

13/20

Characterization

20/20

Staging

18/20

Vivid Details

10/20

Cinematic Language

11/20

Continuity

17/20

Score: 105/140 (B) — Real scores from Session 103

Visual QC Gemini Vision $0.006

After images are generated, Gemini Vision inspects its own work. Three separate checks: portrait consistency (does the face match the character card?), background accuracy (does the setting match the scene?), and action frame coherence (do sequential frames look like the same character?).

✓ Portrait: matches character card

✓ Background: setting is accurate

✓ Frames: consistent across sequence

Cross-Sequence Consistency Rules-Based $0.00

The "Script Supervisor" - checks that visual details stay consistent across the entire moment. Does the same character look the same in sequence 1 and sequence 6? Does the background palette drift between shots in the same location?

✓ 70%+ dialogue lines covered in plan

✓ Named items have close-ups

✓ Palette consistent within location

✓ Every speaker has a portrait

⚠ The $6 Lesson

Early in development, I gave image generation an unlimited retry budget. The QC would fail a frame, regenerate it, fail again, regenerate - 200+ images for a single sequence. Typical generation cost: $0.40. That run: $6+.

The fix wasn't "better AI." It was better rubrics and hard retry caps. AI quality control needs the same thing human quality control needs: clear criteria, not vibes.

Philosophy: "Cheap Checks Everywhere"

Two of the five QC layers cost literally nothing - they're rules-based string matching and field validation. The other three cost a combined $0.012. The most expensive single image generation costs more than all five QC layers combined. Prevention is cheaper than regeneration, every time.

The Rubric Approach

Each QC dimension has specific, measurable criteria. "Is the character fidelity good?" is a bad check. "Does the portrait description match the canonical character card within 3 key visual attributes (hair color, skin tone, signature items)?" is a good one. If you can't explain what "good" means in specific terms, you can't explain it to an AI either.

Cast the Voices

Decide what you hear: full performance, a quick bark, or just text

~$0.001

Full Voice Play the entire audio clip

"Dead things have stuff. That's just economics."

Climactic punchline. Great voice acting. The audience needs to hear this.

Bark Play first 1-2 words, rest as text

"Classic Bixie."

Supporting dialogue. You hear the voice, get the character, read the rest.

Text Only Typewriter with beep sounds

The party approaches the ribcage...

Connective exposition. AI-generated bridge text. No recorded audio exists.

The JRPG Inspiration

Japanese RPGs solved this exact problem decades ago. You can't voice every line in a 60-hour game, so you tier them: important scenes get full voice, supporting dialogue gets a bark (a short vocalization), and exposition gets text with a character beep. The same logic applies to a 60-second Short with 8-12 dialogue lines.

The Decision Criteria

Full Voice goes to: climactic lines, emotional peaks, great punchlines, dramatic NPC voices. Bark goes to: supporting dialogue that benefits from hearing the voice. Text Only is reserved for AI-generated bridge text where no audio recording exists. Hard rule: if real recorded audio exists, it's never text-only.

Generate the Art

Turn descriptions into 16-bit pixel art portraits, backgrounds, and action frames

~$0.50

Character Card

"Lithe half-elf woman with white-blonde hair swept up in a high bun, pointed ears. Fair skin, sharp angular cheekbones, sly confident smirk. Dark leather vest over white shirt..."

Generated Portraits

comedic bixie_comedic.png

dark bixie_dark.png

triumphant bixie_triumphant.png

Each character gets a canonical 512x512 "token" portrait. Mood variants and mouth animation frames are generated using the token as a reference image, ensuring visual consistency. If bixie_triumphant already exists in the knowledge base, the system skips generation entirely.

Where the Money Goes

Image generation is 95% of the per-video cost. The entire analysis pipeline (parsing, highlights, summary, director, QC) costs ~$0.11. Image generation costs ~$0.45-$0.57. That's why asset reuse matters - if a character portrait already exists, we skip the most expensive call entirely.

The Style Guide

16-bit SNES-era aesthetic: think Octopath Traveler, Final Fantasy VI, Chrono Trigger. Every image generation prompt starts with the same style prefix to maintain consistency across sessions, characters, and backgrounds.

Assemble and Export

Parts become a whole - HTML animation, then MP4 video

$0.00

🖼

Portraits + Backgrounds + Frames

📄

HTML Animation Template

🎬

Puppeteer Frame Capture

🎥

ffmpeg Encode to MP4

HTML-First Assembly

Each scene is first assembled as a self-contained HTML page with CSS animations, dialogue timing, and embedded audio. This gives a live preview before committing to video export. Puppeteer then captures frames from the browser, and ffmpeg encodes them into the final MP4.

Premiere XML Export

For manual editing, the system also exports a Premiere Pro-compatible XML with all clips on separate tracks. Room tone goes on an independent A2 track. This lets a human editor make final adjustments without re-running the pipeline.

Philosophies

How I think about building AI-powered creative systems

Be the CEO, Let AI Be the CTO

I told Claude: you are the CTO of this project. Research the wise answer to technical questions. Execute solutions that drive our bottom-line metrics.

It found Apple's local audio analysis models on my machine. I didn't even know I had access to them. That discovery saved real money by running processes locally instead of hitting paid APIs for every audio analysis call.

The catch: this only works if you slow down and actually understand what it's proposing. You're the CEO - you set the vision, you approve the architecture, you make the judgment calls about what matters. But you don't have to be the one who knows the fastest way to extract audio peaks from a 3-hour file. That's the CTO's job.

Your job: vision, priorities, judgment calls.
AI's job: research, execution, technical options.

Cast Your AI Like a Film Crew

When Hollywood adapts a book into a movie, they don't hand it to one person. There's a director, a script supervisor, a line editor, a lore manager, a QC team. I built my AI pipeline the same way.

The Director Claude Opus Plans the storyboard, makes creative calls $0.03-$0.30

The Lore Master Rules Engine Visual consistency, proper noun translation $0.00

The Line Editor Claude Haiku Timing validation, field structure, technical checks $0.001

The Script Supervisor Claude Sonnet Pacing review, character fidelity, scene coherence $0.005

The Audio Director Claude Haiku Voice tier decisions (full voice / bark / text) $0.001

The Artist Gemini Flash Pixel art generation (portraits, backgrounds, frames) $0.04-$0.08

The QC Team Gemini Vision Inspects generated images for consistency $0.002

Different models for different jobs, based on what the job actually requires. Opus only where judgment is irreplaceable. Haiku everywhere speed matters more than depth. Rules-based checks where you don't need AI at all.

Data Integrity Is the Entire Game

When you're stringing AI calls together, every handoff is a potential break point.

Here's a real example: the Director reorders dialogue lines for dramatic effect. The old system used line numbers to match audio to dialogue. Line numbers shifted when the Director moved things around. Audio beds broke. Video and audio desynced by 81 seconds.

The fix: immutable IDs from the original transcript (cueIds) as the single source of truth across all 13 pipeline stages. The Director can reorder whatever it wants - the cueId never changes.

Same principle with proper noun translation. Gemini can't draw "Bazzoxan." But it can draw "a dark basalt military outpost with scarred walls." The translation layer is what lets different AI models collaborate without corrupting each other's data.

Every AI handoff needs three things: an immutable key, a validated schema, and a translation layer.

The 80/20 Reality

AI gets you 80% of the way there. The last 20% is the entire project.

It always feels like one more data point will close the gap. If only the AI understood that the laughter was loud vs. quiet. If only it could detect sarcasm. If only it knew the emotional history between these two characters.

You can think of a thousand data points. But what you're evaluating as a human - especially around art and communication with other people - is so minute and fine-tuned that it doesn't matter if you have every signal. You still benefit from a human interjecting 10-20% of the vision.

So I stopped building an automator and started building a tool that presents information to a director and gets quick decisions back.

The best AI pipelines aren't autonomous. They're fast feedback loops with a human in the chair that matters.

Lessons Learned

What I'd do differently and why it matters

Start With Your Hardest Data Problem

I started with text because it felt lightweight. I got a great animation engine out of it. Then I went back to add audio - and had to restructure everything.

What I thought I was doing: building incrementally. Text first, audio later, keep it simple.

What actually happened: the text-first architecture made assumptions about timing, sequencing, and data flow that audio broke. The animation engine assumed character-count-based timing. Audio has real durations. The two systems disagreed on how long every single line should be on screen.

If I could redo it, I'd do a lightweight test of my most complex data problem first - even a rough prototype of the audio integration - before building the visual pipeline around text-only assumptions.

If you know another data source is coming, build the connectors early. Even if you don't use them yet, their shape will influence everything downstream.

Don't Fall for the "One More Data Point" Fallacy

I added energy analysis. Then emotion detection. Then laughter intensity. Then voice acting quality scores. Each one helped - a little.

But the moments the AI consistently missed were the ones that required understanding relationships between players, inside jokes, callbacks to earlier sessions. Things that are technically "data" but practically impossible to capture in a pipeline.

There's a seductive logic: if only I add one more signal, the AI will finally get it. It won't. The gap between "good enough to be useful" and "good enough to replace a human editor" is not a data problem. It's a judgment problem. And judgment is what humans are for.

Diminishing returns on data are real. Know when to stop instrumenting and start designing for human input.

"Just Do It Better" Is Not QC

Early on, my quality control was basically "check if this is good." That's not a rubric. That's a vibe.

The $6 story: I gave image generation an unlimited retry budget. QC would fail a frame, regenerate, fail again, regenerate - 200+ images for one sequence. Typical cost: $0.40. That run: $6+. And the output wasn't even better, because the QC criteria were vague enough that "passing" was essentially random.

The fix: specific, measurable rubrics (seven-dimension scoring), hard retry caps, layered checks at different cost tiers. Two of my five QC layers cost literally zero dollars - they're rules-based checks that don't need AI at all.

If you can't explain to a person what "good" means in specific, measurable terms, you definitely can't explain it to an AI.

Cost Discipline Is a Design Constraint

Every AI call should justify its existence.

Originally, the system analyzed all 7 highlighted moments upfront with full audio processing. That's ~$2.29 each - about $16 total - before a human even looked at the results. Most of those moments get rejected.

The fix: deferred processing. The system does lightweight analysis on everything, presents the results to a human, and only runs the expensive audio pipeline on the single moment the human selects. That one change saved ~85% of the audio budget.

Model tiering matters too. Haiku at $0.001 per call handles validation. Sonnet at $0.005 handles creative review. Opus handles the Director call because that's the one place where creative judgment actually moves the needle. Using Opus for everything would cost ~$15 per video instead of $0.60.

The cheapest AI call is the one you don't make. Design your pipeline so expensive models only touch expensive decisions.

How do you turn the joy of a board game into data an AI can manipulate to produce something like this?

More from the same campaign

By The Numbers

The Pipeline

Parse the Chaos

How It Detects the DM

Two VTT Formats

Speaker Aliases

Listen to the Room

Energy Analysis

Emotion Detection

The "Silence-Then-Spike" Pattern

Find the Moments

Moment Taxonomy

Emotional Arc Extraction

The Actual Output

Translate the World

Why This Matters

The Prompt Sanitizer

Cost: $0.002

Direct the Story

Why Opus?

5 Framing Strategies

Post-Director Guardrails

Quality Control

Philosophy: "Cheap Checks Everywhere"

The Rubric Approach

Cast the Voices

The JRPG Inspiration

The Decision Criteria

Generate the Art

Where the Money Goes

The Style Guide

Assemble and Export

HTML-First Assembly

Premiere XML Export

Philosophies

Be the CEO, Let AI Be the CTO

Cast Your AI Like a Film Crew

Data Integrity Is the Entire Game

The 80/20 Reality

Lessons Learned

Start With Your Hardest Data Problem

Don't Fall for the "One More Data Point" Fallacy

"Just Do It Better" Is Not QC

Cost Discipline Is a Design Constraint

About This Project

My Three Laws of AI