Case Study · The D&D Knowledge Graph
"Walk me through the main villain's story across four years of play."
Session 27: Alyxian is the first name-drop in an ancient legend. Our heroes have no context.
Session 63: Learn that he is a prisoner held in the Netherdeep. Originally a hero, now possibly corrupt.
Session 102: Calli kills Alyxian's archdevil avatar in combat. The Jewel of Three Prayers exalts into its final form.
Session 114: The party learns why Alyxian is trapped in the Netherdeep by the devils.
Pulled from 17 events, 24 nodes, 9 sessions. ~2,800 tokens of subgraph context.
Four years. Thousands of threads. Zero organization — until now.
≈ 10,000× cheaper per answer · same question · better grounding
This graph is the result of four years of labor and love. My friends and I have been telling the same story in our D&D campaign over 115 sessions, which has taken us almost four years. That means this graph tracks over a hundred characters. Twenty-five locations. Every relationship, every betrayal, every promise kept and broken.
I built this visual graph for two reasons. First, to be a brainstorming and shortcut tool for myself. With hundreds of hours of gameplay over the years, there are things that even I, the Game Master and author of many of these characters, forget when I am preparing the next session.
Second, I built this for my other AI projects. AI tools can easily query this graph and extract meaningful knowledge from it. What would have taken scanning 10 million tokens' worth of context can now be done with a 3,000-token query. Typically, an AI tool needs to crawl large amounts of text, and users rely on hoping the AI pulls out the right context. This graph lets both me and my tools see relationships over time and get the entire picture of our story without blowing our entire token budget.
The actual graph. Live. Click any node. Ask it questions.
Runs on a free-tier Cloudflare Worker with rate limits. If the AI button is slow, it's probably Haiku warming up. Try things like: "What are the biggest turning points in this campaign?" · "Who has the party saved from death the most?" · "How has the main villain's story unfolded?"
Can't see the embed? Open the full-screen version.
The Problem
Everybody wants an AI tool that remembers every decision their team has made. Sales teams want one that knows the last 18 months of customer history. You probably want one that's read all your Slack messages, every email thread, every meeting transcript, every strategy doc, so it can write like you and know what you know.
The appeal is obvious and promising, but the implementation doesn't happen by accident.
In April of 2026, I realized the issue I was facing with my D&D campaign and my AI tools was the same one every business using AI faces in context engineering. I have 120 sessions of Dungeons & Dragons over four years, with my notes scattered across Zoom transcripts, DM notes, shared Google Docs, and a pen-and-paper journal. Just processing the transcripts was a feat of its own, with each session being longer than three hours and dozens of open plot points and characters at any given moment.
If I wanted an AI to answer a real question about my campaign, I'd need to feed it all my context in a way it could read. Even with ten sub-agents sweeping in parallel, they still couldn't traverse enough to tell me what an NPC first said about the main villain in session 47 and cross-reference it with what the party learned in session 103. Compressed into a markdown file, the campaign info still runs to over a hundred pages with huge plot holes.
Every query burned through hundreds of thousands of tokens and still didn't get me what I wanted.
So I built a graph instead. Three thousand tokens of relevant subgraph answers any question I have about the campaign over time. The same pattern works for anything a business wants to remember.
Five stages, turning raw transcript into queryable memory. Click any phase to expand.
Speech-to-text mangles every proper noun. Clean this common data problem before doing anything else.
If you skip name normalization, your graph ends up with ten different nodes for the same character. The AI confidently extracts relationships between "Hojbejrg" and "Hodge" as if they are two people. Every downstream query is corrupted.
I built the correction table by hand from four years of watching how Zoom transcripts butcher fantasy names. Each entry gets injected into the extraction prompt as ground truth. When the AI sees "Anchor El" in a transcript, it knows that means Ank'Harel.
Every company has a version of this. Product names. Customer names. Internal acronyms. Slack handles. If you want useful AI over your data, you need a canonical dictionary to differentiate signal from noise.
The DM recap at the start of each session tells you what mattered in the last one. Use that.
Every D&D session starts the same way. The DM says "last time we saw our intrepid heroes, you..." and spends two or three minutes summarizing what happened. That recap is the DM's own judgment about what actually mattered in the last session. It's a free, high-quality summary, embedded in the next session's transcript.
So when I'm extracting structured data from session 80, I don't process it blindly. I first process session 81, save its recap as a brief summary, and then when the extractor runs on session 80, I inject session 81's recap as priority guidance: focus on these moments. The DM thought they mattered.
This means the pipeline works backward. Newest session first. Each session's extraction is primed with what came after.
I processed sessions in natural order. Oldest first. It felt right. Each extraction came out clean but missed the DM's own priority judgment. Flipping the order gave me higher-confidence extractions and fewer invented plot points, without changing the extractor at all.
After each extraction, the session's dm_recap_summary field is stored on the graph's session node. When the prompt for the previous session is built, the pipeline looks up the next session, pulls that summary, and injects it as priority guidance. Events that appear in the recap get a confidence bump.
Meeting notes have the same shape. Next week's kickoff starts with "last week we decided X." That's a free, pre-existing, high-quality summary of what mattered. Most meeting-intelligence tools don't use it because they process chronologically. Reverse the order.
Turn ~8,000 lines of transcript into structured entities, relationships, and events.
01:12:45 Sammi (Calli): He's bleeding out. Kristin, how many hit points is that? 01:12:49 Kristin (Bixie): He's at zero. First failed save. 01:12:52 Wilson (Hodim): I'm using my last 4th level slot. Cure Wounds on Hoj. 01:13:04 Connor (DM): He stabilizes. Jordan, you owe him. 01:13:08 Jordan (Hoj): Yeah. I owe him.
{ "event": "Hodim saves Hoj from death", "subtype": "save", "participants": ["char_hodim", "char_hoj"], "emotional_tone": "relief", "animation_potential": 8, "evidence": "He stabilizes. You owe him.", "confidence": 0.9, "session": 103, "new_edges": [ {"source":"char_hoj","target":"char_hodim", "relationship":"RESCUED_BY"} ] }
Five Claude Sonnet sub-agents run at the same time, each on one session. Each session takes three to four minutes. Twenty sessions are processed in under twenty minutes instead of a couple of hours.
Every entity and edge carries a confidence score, a source quote, and a session number. Without provenance, you can't verify anything the AI produced. You also can't tell what the AI was sure about from what it guessed.
The extraction prompt has one explicit rule: if a name doesn't appear in the transcript, use a descriptive ID like "char_unnamed_tech_elf." Don't invent proper nouns. That single rule catches more hallucinations than the QC layer ever did.
Three cheap checks before anything touches the graph.
[ERROR] char_grumpy_mushroom_demon_door marked ALLIED_WITH char_bixie.
Enemies are not allies of the party — likely extraction error.
The validator caught the AI confidently inventing an alliance with a literal demonic door. Edge removed before merge.
Session 100 extraction → FIRST_MET(Hodim, Father Greenbriar)
Session 90 extraction → FIRST_MET(Hodim, Father Greenbriar) ⚠ conflict
→ Merge prunes Session 100 edge, keeps Session 90 (earlier in story time)
Because I process sessions in reverse order, the FIRST_MET from session 100 lands in the graph first. Then when session 90 produces the same first-meeting claim, merge recognizes the conflict and replaces the later one with the earlier one. The chronology stays honest even when the processing order doesn't.
Every rule that can be a rule should be a rule. Rules are free, fast, deterministic, and explainable. I only use AI where I can't write the rule. For ninety percent of graph QC, I can write the rule.
Merge tracks which session each entity first appears in. When you click a node, the detail panel shows "First seen in Session 83." This metadata powers spoiler filtering when you want a player-safe version of the graph.
Your CRM has duplicate contact records. Your knowledge base has the same article written four times by four different people. Entity resolution is cheap to automate and worth every hour you spend on it. Nothing downstream works without it.
Natural language in. Relevant subgraph out. Answer rendered with clickable evidence.
Sending the full 1,067-node graph to the AI on every query would cost twenty times more per call and give you worse answers. Focused context beats bulk context. Extract the 1% of the graph that matters, and send that.
GitHub Pages is static hosting. It can't hold an API key. Without a proxy, the Anthropic key would be in the page source, where anyone could grab it. A Cloudflare Worker holds the key on the server side, enforces rate limits (5/min, 50/day per IP, 500/day global), and locks CORS for this case study's domain.
Haiku, 1,024 max output tokens, daily request caps in KV storage. Worst case for a viral moment: about $15 per month. Normal case: pennies. The API key itself has a monthly spend cap set in the Anthropic console as final insurance.
How I think about building AI systems that have memory
Context engineering is the part of AI work that doesn't make the demo video but determines whether your output is usable. You're not prompting. You're deciding what the model gets to see, in what order, at what fidelity, before the model even starts thinking.
D&D is my sandbox for practicing this difficult art-slash-science. I have absurd amounts of relational data with no business consequences if I get it wrong. Every mistake teaches me something I can apply to a client's real data problem next week.
Companies that figure out their own context engineering will outcompete companies that don't. It's that simple.
"Tell me about Hoj" gives you a paragraph. A structured entity with 17 fields and 45 edges gives you something you can filter, cross-reference, aggregate, and re-query.
Summaries are terminal. Once you write one, the rest of the information is compressed away. A graph is generative. You can always ask a new question because the underlying entities still have structure.
Rule I follow: if a piece of information might be relevant to more than one future question, store it in a structured format. Not as a summary.
Every extraction has a confidence score. Every edge has a source quote. Every quote is tied to a session number. Click an edge in the graph, and you see where it came from.
This is how you make AI output trustworthy. Not by spending more tokens, but by making it show its work and trace beliefs back to the origin of the data.
If you can't audit the AI's reasoning trail, you can never confidently act on its output.
Most AI apps try to cram context into every prompt. Send the whole document. Send the whole conversation history. Send every possibly-relevant file. It's expensive, slow, and the model still loses track of what matters.
The alternative is to put context in a graph. At query time, extract the small subgraph that actually matters for the specific question. Pass only that.
One of my queries uses about 3,000 tokens of context instead of the ten million that a full transcript dump would need. That's roughly four orders of magnitude cheaper per call. Compounded across a company's daily AI usage, it's the difference between an AI tool you can afford to use and one you can't.
D&D is my favorite hobby to try out my latest AI experiments. But context engineering, done effectively, could save your business a lot of pain and suffering when it comes to AI.
Every decision your team has made is documented somewhere. The strategy doc from Q3 last year. The Slack thread where you decided to cut that feature. The customer call that made you pivot.
When a new hire asks, "why do we do it this way," they get a shrug. When your AI assistant gets the same question, it makes something up.
Same pattern as D&D: extract decisions, participants, and outcomes into a graph. Ask it "why did we choose vendor X" and get a grounded answer with links to the actual thread.
You've been selling to this company for eighteen months. Three account managers have talked to them. There's a folder of call recordings, a thread of email replies, a pile of Loom videos.
Your AI tool sees none of that. It writes a generic email.
Same pattern as D&D: a customer call is a transcript, same shape as a D&D session. Extract the relationship graph. Next time someone writes outreach to that customer, the AI has actual context to work from.
This is my world. A training team runs forty-plus learning experiences a year. They record debriefs, write retros, capture feedback. Six months later, that knowledge lives in someone's head. If that person leaves, it's gone.
A knowledge graph over training artifacts lets instructors ask "what do we know about how software engineers respond to case-study format" and get grounded answers across years of prior work. Not summaries. Evidence.
Same pattern as D&D: session transcripts are training artifacts. Learner behavior is character behavior. The graph remembers what each cohort did, so the next cohort's design is smarter.
If you want to build something like this at your company, that's what my AI implementation work is for.
See how I help teams with AI implementationThings I'd do differently, and why they matter for any AI pipeline you build
I processed sessions in natural order at first. Oldest to newest. Every extraction was blind to what came next. The extractions came out fine.
Then I realized: every session starts with a recap. That recap is the DM's summary of what mattered in the last session. If I process session 81 first, I can use its recap as priority guidance when I extract session 80.
Flipping the order gave me higher-confidence extractions and fewer invented plot points. I didn't upgrade the model. I didn't lengthen the prompt. I re-sequenced the work so the information I already had was available at the right time.
The extractor will confidently mark hostile NPCs as ALLIED_WITH the party about five percent of the time. Once I noticed the pattern, I wrote a six-line validator that flags any edge where an entity with subtype: enemy or subtype: creature has an ALLIED_WITH relationship with a PC. Catches it every time.
The lesson isn't that AI is bad at this. The lesson is that AI pipelines need monitoring, just like any other data pipeline. You can predict the specific ways yours will fail. You can write deterministic checks for those specific failures. Do it once. It saves you every time.
Teams that skip this step end up with dashboards full of confident nonsense. Teams that bake it in get trustworthy systems that get better, not worse, as they grow.
My early extractions had no source quotes. Everything looked clean until I tried to verify a specific edge and realized I had no way to check if it was real. Retrofitting provenance meant re-running every extraction. Expensive. Painful. Avoidable.
Now every field carries a source quote, session number, and confidence score. When the AI says "Eas has romantic tension with Bixie," I click the edge and see the quote that produced it. Five seconds to verify instead of five minutes of re-reading the transcript.
The bigger win: an AI querying this graph can make its own judgment calls about what to trust and what to hedge on, because it can see the evidence behind every claim. Accuracy is not the goal. Verifiability is.
Five sub-agents, working in parallel, complete five sessions in the time it would take one agent to complete one.
But merging has to be serial. Two extractions trying to write to the graph file at the same time produces corruption I spent a day debugging before I figured it out.
Concurrency where the work is independent. Discipline where shared state changes. Knowing the difference is most of the job.
I'm Connor Koblinski. I'm an educator, AI trainer, and professional Dungeon Master. I've built curriculum for Ramp, Snapchat, Niantic, and Solana. I teach people to use AI for things they couldn't do before, not to automate things they already know how to do.
This case study is the companion to the D&D to animated film pipeline. That one is about turning transcripts into video. This one is about turning transcripts into memory.
They share the same source data. They solve different problems. Together, they're my argument that AI is most valuable when you give it structure to work with.
Built April 2026. Companion piece: the D&D-to-animated-film shorts pipeline.