Teaching an Open-Source LLM to Write The Office

Fine-tuning a reasoning-first LLM to generate sitcom screenplays with on-brand humor, character voice, and multi-step setups.

View Dataset on HuggingFace View SFT Model View RFT Model

Why This Project?

Most LLM demos focus on generic chat or coding. I wanted to show something different:

Specific Domain

A single sitcom (The Office) with strong, recognizable character voices.

Reasoning-Heavy Format

Each sample includes both a planning / reasoning trace and the final screenplay.

Production Pipeline

Data curation, SFT, RFT, automated evaluation, and visualization.

This case study doubles as:

A product demo: "What if you could auto-generate new The Office episodes?"
A skills demo: End-to-end fine-tuning of open-source LLMs for a narrow, stylistic generation task.

System Overview

Goal: Given a high-level sitcom situation (e.g., "Michael uses Pam's post-its to avoid work calls"), generate:

A reasoning trace that plans beats, character goals, and comedic engines.
A full screenplay scene consistent with The Office tone.

Reasoning Trace Structure

The training dataset uses a structured reasoning-first approach where each sample includes a comprehensive creative blueprint before the screenplay. The reasoning trace contains:

1. Storyline Goal

Defines the narrative purpose, core conflict, and comedic goal.

2. Character Objectives

Each character's immediate want or need.

3. Character Dynamics

Interpersonal conflicts and alliances.

4. Meta Reasoning

Writer's room approach—why this is funny.

5. Primary Comedy Engine

Cringe, Dramatic Irony, Absurdity, Escalation.

6. Beat Sheet

Inciting Incident → Rising Action → Climax → Resolution.

7. Talking Head Strategies

How characters use confessionals for comedy.

8. Comedy Tropes Applied

Specific comedic devices used.

This structured approach teaches the model to think like an Emmy-winning TV writer before generating the final screenplay.

Models Compared

Base Model

Gemma-3 1B — Original LLM (no domain fine-tuning)

SFT

CoT Reasoning — Supervised fine-tune on reasoning + screenplay

RFT

Model Grader — Reinforcement fine-tune using LLM-as-judge rewards

LLM-as-Judge Evaluation Criteria

The judge evaluates each generated screenplay using eight weighted metrics that capture both technical quality and stylistic authenticity:

Metric	Weight	Focus
Character Consistency	25%	Does dialogue perfectly align with each character's established persona?
Humor Quality & Specificity	25%	Is the humor effective and consistent with The Office's comedic DNA?
Narrative Coherence	15%	Does the screenplay follow a logical comedic progression?
Style Fidelity	15%	Authentic mockumentary techniques (talking heads, camera glances)?
Dialogue Plausibility	5%	Natural and conversational while being witty?
Creative Plausibility	5%	Fresh ideas that fit within the show's reality?
Formatting Accuracy	5%	Strict adherence to structure?
Relevance to Storyline	5%	Reflects the provided scenario?

The final score is a weighted average with the highest weights on Character Consistency and Humor Quality—the two elements that define The Office's unique voice.

Training & Results

Reinforcement Fine-Tuning Rewards

How quickly does the policy learn to please the sitcom-style judge?

The reward progression plot shows step-wise rewards (light red) and a 20-step rolling average (bold red):

Early steps show high variance and lower average rewards.
The rolling average climbs steadily as the policy learns, then plateaus as it reaches a stable style that the judge prefers.
Occasional dips reflect exploration and noisy judge scores, but the overall trajectory trends upward.

You can think of this as the model gradually learning: "Don't just be coherent—be character-consistent, witty, and structurally Office-like."

Quantitative Evaluation

I evaluated all three models on a held-out set of sitcom prompts, scoring each output with a domain-tuned LLM-as-judge (0–1 scale, normalized).

Score Distribution by Model

The Base Model is concentrated at lower scores, with most samples clustered toward the left.
SFT Model shifts the distribution right: more samples in the mid-to-high range.
RFT Model also lives in the higher band, trading a bit of spread for more consistently good outputs.

Visually, you can see the "cloud" of scores moving to the right as training progresses from Base → SFT → RFT.

Boxplot Comparison

Median score jumps significantly from Base → SFT.
RFT retains a higher median than the base model, and its interquartile range sits above most of the Base distribution.
Outliers reveal that:
- Base occasionally gets lucky with a good scene.
- SFT and RFT more reliably hit decent quality, with fewer catastrophic failures.

Key Takeaway: Fine-tuning doesn't just help a few cherry-picked cases—it shifts the overall quality level up.

Hero Examples

Compare the output of the Base model, SFT model, and the final RFT model. Notice how the RFT model captures the specific "mockumentary" style and character voices much better. Each model column includes a qualitative analysis showing the progressive improvement from Base → SFT → RFT.

Model Comparison

Michael Fakes Productivity with Pam's Post-its

Situation:

Michael uses Pam’s Post-It notes to avoid work calls and appear busy in his office.

Characters:

Michael, Pam, Oscar, Kevin, Ryan, Jan

Setting:

Dunder Mifflin Scranton branch, Michael's office area

Base ModelScore: 0.188

Okay, here's the reasoning trace and complete screenplay, incorporating the information above and aiming for a realistic, slightly chaotic, and darkly comedic tone. **`reasoning_trace`** **1. Storyline Goal:** Highlight Michael’s ignorance and performative leadership style while exploring Oscar’s discomfort and need for authenticity. The comedic engine should be Michael’s obliviousness and the staff’s various reactions. **2. Character Objectives:** * ** Michael:** To appear capable and supportive while simultaneously remaining oblivious to the complexity of Oscar’s situation. Michael wants to perform allyship without truly understanding what it means. * ** Toby:** To act as a grounding force, subtly directing Michael away from the most damaging behaviors without directly confronting him in a way that would escalate his defensiveness. * ** Oscar:** To express his truth without triggering a full-blown crisis. He wants to feel safe and understood, but also to gauge the reaction of those around him. * ** Angela:** To express her frustration, but to maintain a passive-aggressive posture, subtly highlighting Michael's actions. She needs to feel morally superior. * ** Pam:** To observe and process the situation, acting as a quiet emotional barometer. She wants to support Oscar but is unsure how to step in. * ** Jim:** To deflate tension when needed with humor, while still acknowledging the seriousness of the moment. **3. Tone & Structure:** We’ll keep the tone uncomfortable but not cruel, with a gradual escalation of awkward comments from Michael, interspersed with reaction shots and subtle interjections from Toby and Jim. **4. Comedy Engines:** * Michael’s performative allyship and misunderstanding. * Angela’s silent judgment and passive-aggressive comments. * Jim’s attempts to recalibrate the room with quips. **5. Scene Beats:** 1. Michael calls an impromptu “support meeting” in the conference room. 2. Oscar is uncomfortable but present; the rest of the office attends reluctantly. 3. Michael tries to “facilitate” a conversation about acceptance. 4. Toby attempts to steer the conversation gently. 5. Angela makes a thinly veiled judgmental remark. 6. Jim and Pam exchange looks and step in with subtle deflections. 7. Michael closes the meeting with an awkward, over-the-top gesture of “support.” **`screenplay`** INT. DUNDER MIFFLIN – CONFERENCE ROOM – DAY The staff is gathered around the conference table. MICHAEL stands at the head, hands on his hips, radiating smug self-importance. A whiteboard behind him reads in big, uneven letters: “ACCEPTANCE & DIVERSITY: LET’S TALK.” There’s also a crude drawing of a rainbow. OSCAR sits near the middle, stiff-backed, arms crossed. PAM and JIM sit together, exchanging wary glances. ANGELA sits near the end with her arms tightly folded. TOBY is near the corner, legal pad in hand. MICHAEL: (too cheerful) All right, everybody! Settle in. What we’re doing today is… we are… *celebrating.* JIM: (aside to Pam) Did we win a paper-selling contest I didn’t know about? PAM: (whisper) Pretty sure *we* lost. MICHAEL: We are celebrating… differences. Diversity. Uniqueness. All the colors of the rainbow. (gestures at his drawing) Especially… uh… our very own Oscar. OSCAR shifts awkwardly, forcing a polite smile. OSCAR: Michael, I don’t— MICHAEL: (interrupting) No, no, let me finish. Because I have been doing a *lot* of reading on this. (taps a thin, generic HR pamphlet) And I want everyone to know that this is a safe space. ANGELA rolls her eyes, just enough for the camera to catch. TOBY: (careful) Michael, maybe we don’t need to make this a… public forum? MICHAEL: Toby, this is exactly what we need. Dialogue. Conversation. Healing. JIM: (to camera) Usually when Michael says “healing,” someone ends up crying in the stairwell. ANGELA: (coolly) Some people here have *very* strong opinions about… lifestyles. OSCAR: (tensing) I really don’t think— MICHAEL: And that’s okay! That’s what we’re here for. To get it all out in the open. Like an… emotional colon cleanse. TOBY winces visibly. TOBY: Maybe… maybe not that metaphor. MICHAEL: Fine. Emotional… spa day. He smiles proudly, convinced he nailed it. PAM: (softly) Michael, I think Oscar might just want to… do his job. OSCAR: Yeah. I appreciate the support, but I don’t need a… meeting. MICHAEL: (overly earnest) But I do. *I* need this. I need everyone to know that I am… (searching) very accepting. Of all kinds of… flavors. Of people. JIM: Under his breath. Like Baskin-Robbins, but more uncomfortable. ANGELA: If people didn’t insist on announcing their… *private lives* to the office, we wouldn’t be in this situation. OSCAR looks down, jaw tight. TOBY: Angela, we need to be respectful. MICHAEL: Yes, thank you, Toby. Finally being useful. Look, let’s just do this: everyone go around and say one thing they “accept” about Oscar. The room collectively recoils. PAM: Michael… JIM: I accept that this is the worst idea you’ve had in, like… two weeks. MICHAEL: Jim, that’s… not the exercise. But I like the energy. He points to PAM. MICHAEL (CONT’D): Pam, go. PAM: (sighs) I… accept that Oscar is really good at his job. OSCAR gives her a grateful look. MICHAEL: Great. See? That’s progress. Dwight? We CUT TO a quick montage of increasingly awkward “acceptance” statements, Angela’s thinly veiled hostility, and Michael’s oblivious “facilitation,” ending with Oscar quietly excusing himself. INT. OSCAR’S CONFESSIONAL – LATER OSCAR: I came out so I wouldn’t have to hide anymore. Somehow this feels… worse than hiding. INT. MICHAEL’S CONFESSIONAL – LATER MICHAEL: Today, I made history. I fixed homophobia. You’re welcome, Scranton. He smiles like he expects applause. FADE OUT.

Analysis

Base model stays in meta 'reasoning' mode instead of delivering a scene.

SFT ModelScore: 0.465

Michael: No!, Pam: How am I going to tell your dad that I am away from him for three days and I don't know where he's at?, Michael: Well, um, maybe we could write it down on a Post-It., Pam: (confused) A Post-It?, Michael: Yeah. (gesturing to the yellow Post-Its on his desk) I mean, you know, it's like a visual reminder. So you can remember to worry later., Pam: (deadpan) So you're suggesting I procrastinate my anxiety?, Michael: Exactly. (proud) See? You get it., Pam: I don't think that's how emotions work., Michael: Pam, emotions are... (waves hands vaguely) like calls. You screen the ones you don't want to deal with., Pam: (raising an eyebrow) Like you're doing with your actual calls?, Michael: (offended) Hey, these are *very* important calls that require deep, strategic thinking., Oscar: (passing by the office door, noticing the untouched phone) Your phone hasn't rung in ten minutes., Michael: Well, that’s because I’ve already resolved my outstanding issues., Ryan: (popping his head in) Jan called twice. You forwarded her to voicemail both times., Michael: (suddenly defensive) That’s called “setting boundaries,” Ryan., Kevin: (from desk) I thought it was called “ducking responsibility.”, Michael: (pointing at Kevin) You know what, Kevin? That attitude is why you don’t have your own Post-It system., Pam: (playing along) Yeah, it’s an elite productivity tool., Kevin: (excited) Can I get one?, Michael: (magnanimous) When you’re ready. (leans back, pretending to be very busy) CUTAWAY – INTERVIEW – PAM Pam sits in the conference room. Pam: Michael discovered Post-Its last week. He calls it “Post-It Productivity 2.0.” (shrugs) He’s used more paper in three days than the copier has all month. CUT BACK TO – MICHAEL’S OFFICE Michael is sticking Post-Its all over his desk, computer, and phone. Pam: (looking at a note on his monitor) “Act busy if Jan calls”?, Michael: (nodding seriously) It’s a high-priority task., Pam: This is just… a reminder to lie., Michael: (offended) No, Pam. It’s a reminder to *strategize*., Oscar: (leaning in the doorway) You know, you could just… answer the calls and do your job., Michael: (gasps) Wow. Big talk from someone whose job is… (gestures vaguely) numbers., Oscar: (dry) I’m an accountant., Michael: That’s what I said. “Numbers.”, Pam: (smiling slightly) Oscar’s good with systems., Michael: (defensive) I have a system., Ryan: (walking by) Is it just “avoid everything and write it on a sticky note”?, Michael: (proudly) With color-coding., Kevin: (from the bullpen) Can I color-code my lunch breaks?, Michael: No, Kevin. That’s chaos., Pam: (picking up a pink Post-It) “Remember to look busy.”, Michael: That one is aspirational., Pam: Have you considered just… being busy?, Michael: (pouting) I am busy., Pam: Doing what?, Michael: (glancing at his Post-Its) Delegating. (points at Pam’s desk) Starting with you., Pam: (sighs) Great., Jan: (voice, on speaker) Michael., Michael: (panicking, slaps a Post-It over the speaker) Pam! I need a Post-It that says “Everything is fine.”, Pam: (dryly) I think we’re out of those. The camera zooms in on a Post-It reading: “Don’t panic when Jan calls.” CUTAWAY – INTERVIEW – OSCAR Oscar: Michael’s system is… (searching for a diplomatic word) inefficient., Oscar: He’s turned avoiding calls into a full-time job., Oscar: (deadpan) Maybe we should start billing *his* Post-Its.

Analysis

SFT introduces a playable scene with Michael dodging responsibility using Post-its.

RFT ModelScore: 0.665

Michael: [opens his computer and gestures to Pam's computer] Here we go. Digital age. I am upgrading my Post-It system., Pam: (leaning in the doorway with her mug) You're… upgrading the sticky notes?, Michael: (proudly) Yes. From analog to hybrid. Some stick, some click., Pam: (to camera) I miss when he just misused email., Michael: (typing furiously) Okay, first order of business: "Return client calls." [He types it into an instant message window… to himself.], Pam: You could also, crazy idea, just answer the phone., Michael: No, because then they can hear the fear in your voice. You have to project confidence., Phone: [Rings on Michael's desk. Caller ID: JAN LEVINSON.], Michael: (glances at the phone, then at Pam) Okay, I need a note., Pam: (already pulling a pad) Let me guess. “Avoid Jan.”, Michael: (whispers) No, that sounds harsh. Write… "Temporarily emotionally unavailable." That sounds professional., Pam: (writing slowly) "Tem… po… rarily…" you know this is still lying, right?, Michael: It's not lying if the note is yellow. It’s… soft truth., Oscar: [walking by, with a folder] Jan again?, Pam: Third time., Oscar: (peeks in) Michael, she signed off on the budget. You kind of need to talk to her., Michael: (panic-smiling) I *am* talking to her. Spiritually. We are having a… long-distance feelings meeting., Oscar: That’s not a real thing., Michael: It is in management., Oscar: No., Michael: Yes., Oscar: No., Michael: (points at his Post-Its) Do these look like the notes of a man who doesn’t know what he’s doing?, camera: [pushes in to reveal chaotic wall of Post- Its: “Call Jan?”, “Don’t call Jan,” “Ask about budget,” “Don’t mention budget,” “Pretend busy.”], Oscar: (beat) They look like the notes of a man being gently attacked by his own responsibilities., Kevin: [poking his head in] Hey, uh, Michael? About lunch orders—, Michael: Kevin, not now, I am in the middle of a very delicate strategic operation., Kevin: (squints at the wall) Is this about the time you told Jan you were “on a client call” but you were actually hiding in the warehouse tasting pretzels?, Michael: (offended) That was market research., Pam: You ate six pretzels., Michael: (defensive) I had to get a representative sample., Ryan: [enters with a clipboard] Hey, Jan wants an update on the quarterly projections. She asked if you got her email., Michael: (gestures at a Post-It) It’s right here. "Read Jan’s email.”, Ryan: And did you?, Michael: (glances at computer) I… looked at the subject line., Ryan: That’s not reading., Michael: That is *previewing*. It's very common. Steve Jobs did it., Ryan: Steve Jobs read his emails., Michael: (waving him off) Well, he didn't have Post-Its, did he?, CUTAWAY – INTERVIEW – PAM Pam sits in the conference room. Pam: Michael’s “system” is supposed to help him stay organized. (beat) Today, I watched him create a Post-It reminding himself to check another Post-It., Pam: He’s building a… paper-based anxiety maze., CUT BACK TO – MICHAEL’S OFFICE Phone: [Rings again. JAN is calling.], Michael: (frozen, then lunges) Pam! Decoy!, Pam: (blinks) You want me to… decoy Jan?, Michael: Yes! Tell her I’m in… (thinks) A leadership summit., Pam: With who?, Michael: (panicking) With… other leaders., Pam: Wow. Specific., Oscar: (re-entering, more insistent) Michael, this is about my expense report. She’s not calling for fun., Michael: (stressed) Okay, okay. New plan. (grabs a bright pink Post-It and writes “BE BRAVE”) There. Motivational., Pam: (reading it) Brave would be answering the phone., Michael: (stares at it, then quickly sticks it on Pam’s monitor instead) Delegated bravery., Pam: Nope. (picks up his phone, hits the button, hands it to him) Hi Jan. *He’s right here.* Michael: (horrified, then resigned) Jan! Heyyyy… how’s… *everything?*, Jan: (V.O., flat) Did you think I wouldn’t call back?, Michael: (laughs weakly) Noooo, no. I was just… in back-to-back… leadership… summits., CUTAWAY – INTERVIEW – OSCAR Oscar: Sometimes the only way to get Michael to do the right thing is to remove his options. (beat) In this case, that option was a small square of adhesive paper. CUTAWAY – INTERVIEW – MICHAEL Michael: (holding a Post-It that says “ANSWER THE PHONE”) Today I learned something. (beat) Next time I’m writing “IGNORE EVERYONE” instead. He sticks it on his own forehead, looking somewhat proud and somewhat doomed. FADE OUT.

Analysis

RFT sharpens Michael's voice, adds better visual gags around the notes, and leans into Pam's dry reactions and Oscar's grounded commentary.

Click on a column to highlight it. The final model typically shows the best performance.

Detailed Training Specifications

Training Pipeline Overview

The complete training pipeline flows from a base open-source model through SFT (supervised fine-tuning on custom reasoning + screenplay data) to RFT (reinforcement fine-tuning with PPO using LLM-as-judge rewards). Continuous evaluation and monitoring ensure quality improvement at each stage.

Data Preparation

Training Dataset

sitcom_storylines_reasoning

Total Scenes

500 reasoning-chain + screenplay pairs

Evaluation Set

100 held-out prompts

Format

Structured reasoning trace + screenplay

SFT (Supervised Fine-Tuning)

Training Epochs

Total Steps

Batch Size

Learning Rate

5e-05 → 1.16e-06

Optimizer

AdamW

Best Checkpoint

Step 45

Fine-tuning Method

LoRA (r=128, α=128)

Training Loss

2.558 → 2.161

Eval Loss

2.478 → 2.311

Hardware

1 x A40 [48 GB VRAM]

Frameworks

TRL, Unsloth, PyTorch

View SFT Model on HuggingFace →

RFT (Reinforcement Fine-Tuning with PPO)

Reward Model

LLM-as-judge (GPT-5)

Training Steps

130

Batch Size

Learning Rate

3.8e-06 → 2.78e-06

KL Divergence

0.15 - 0.46

Reward Progression

0.419 → 0.523

Completion Length

182-878 tokens

Total Tokens

~2.88M

Gradient Clipping

Applied

View RFT Model on HuggingFace →

Key Learnings

This project touches most parts of the modern LLM lifecycle:

Problem Framing

Turn a fuzzy idea ("Office-style scenes") into a concrete objective with measurable rewards.

Custom Data Design

Design a reasoning + screenplay schema. Build prompts and reference scripts to teach the model structure and style.

Supervised & Reinforcement Fine-Tuning

Run SFT to anchor the model in domain behavior. Layer RFT on top to align with a style-aware judge.

Evaluation & Visualization

Implement LLM-as-judge scoring. Visualize distributions (boxplots, histograms) and training reward curves. Curate hero examples that connect metrics to human-perceived quality.

Storytelling & Product Thinking

Package the work as a case study that looks like a product launch: clear problem definition, before/after comparisons, visuals that non-experts can understand.

From Demo to Product: Agentic ScriptWriter Assistant

This trained model demonstrates the foundation for a real-world AI-powered screenwriting co-pilot—a tool for aspiring writers, professional screenwriters, and showrunners to accelerate their creative process while maintaining artistic control.

Instead of replacing writers, the system acts as an intelligent collaborator that helps execute ideas, ensures consistency, and handles the mechanical aspects of screenplay formatting while the human focuses on story and vision.

Core Product Architecture

Story Architect Agent

Brainstorming & Structure Planning

The entry point for writers. This agent helps develop high-level story concepts into structured narratives.

User Input

• Rough situation ideas
• Character preferences
• Comedic tone requirements

Agent Output

• Beat-by-beat scene structure
• Character dynamics map
• Multiple story variations

Screenplay Generator Agent

Dialogue & Scene Execution

This is where our trained RFT model powers the system. Takes the structured reasoning trace and generates production-ready screenplay with authentic character voices.

Leverages Training

• Character consistency (25% weight)
• Show-specific humor (25% weight)
• Mockumentary format adherence

User Controls

• Regenerate specific sections
• Adjust character focus
• Tune comedic intensity

Continuity & Quality Guardian Agent

Script Review & Consistency Checking

Monitors the entire script across multiple scenes, ensuring character arcs, running gags, and show mythology remain consistent.

Automated Checks

• Character voice drift detection
• Timeline consistency validation
• Callback/setup-payoff tracking

Feedback Loop

• Flags inconsistencies for review
• Suggests revisions
• Learns from user edits (RLHF)

Dialogue Polish & Alternative Generator

Iterative Refinement

Provides alternative phrasings, comedic variations, and line-by-line improvements while preserving the writer's intent.

Features

• A/B/C line variations
• Punch-up suggestions
• Timing/pacing adjustments

Writer Control

• Accept/reject changes
• Lock favorite lines
• Set tone constraints

Why This Becomes a Real Product

⚡Speed Without Sacrificing Quality

Writers spend 70% of their time on mechanical work: formatting, ensuring consistency, rewriting dialogue variations. This system handles the grunt work, letting writers focus on creative decisions and story vision. What takes weeks in a traditional writers' room can be iteratively refined in days.

🎯Specialized Domain Models

Unlike generic LLMs, our fine-tuned models are experts in specific show formats. Train separate models for sitcoms, dramas, thriller formats—each deeply understanding the genre's unique storytelling patterns, pacing requirements, and audience expectations.

🔄Continuous Improvement via RLHF

Every writer edit becomes training data. When users accept/reject suggestions, the system learns their preferences. Over time, the agent adapts to individual writing styles while maintaining show-level consistency—creating a personalized co-pilot.

💰Clear Business Model

Subscription tiers: Aspiring writers get basic agents. Professional showrunners pay for multi-episode management, team collaboration, and custom model fine-tuning on their show's existing episodes. Studios license enterprise versions with proprietary IP training.

From Technical Demo to Market Reality

This case study proves the foundational tech works: we can fine-tune models to understand nuanced creative domains, evaluate quality with specialized judges, and generate content that improves measurably across training.

The product layer—agentic workflows, human-in-the-loop refinement, team collaboration tools—turns this technical capability into a tool that practicing writers would actually pay to use. It's not about automating creativity; it's about amplifying human storytellers with AI that understands their craft.

View Dataset on HuggingFace View SFT Model View RFT Model

Let's Connect

Email GitHub LinkedIn Portfolio