Evaluating Production GenAI Apps (Evals Flywheel)

The mental model

LLMs aren’t deterministic software → “unit tests” aren’t enough. Production GenAI requires an evaluation system that continuously turns failures into measurable test cases. This eval flywheel is the #1 difference between “cool demo” and “reliable product.”

1) What you’re evaluating: Model vs System (don’t confuse these)

LLM eval = academic benchmarks (MMLU, etc.) → measures general model capability.
LLM system eval = your app end-to-end (prompt + RAG + tools + guardrails + data) → measures your product reliability.

Heuristic: A model leaderboard score does not predict your RAG chatbot performance. Build your own benchmarks.

2) The Eval Flywheel: 3 phases you must run

A) Design phase: “in-app” real-time correction

Goal: cheaply catch common failures during runtime (fast assertions + one retry).

Examples:

Codegen: import check → execute → if error, loop once with traceback.
RAG: grade retrieval relevance → generate → grade faithfulness → retry/regenerate if needed.

B) Pre-production: offline eval + regression tests

Goal: benchmark changes before shipping; prevent regressions. Start small: 50–100 high-quality examples is enough to begin.

C) Post-production: online eval + monitoring

Goal: score live traffic (reference-free), capture real failures, and bootstrap them back into the offline dataset.

The eval flywheel

3) Evaluation methods: pick the right tool for the job

Method	Best for	Pros	Cons
Human eval	nuanced/subjective quality	gold standard	slow, costly, inconsistent
Code/heuristic eval	format/schema/invariants	fast, cheap, deterministic	rigid, misses nuance
LLM-as-a-judge	scalable nuanced scoring	flexible, automatable	judge bias/flakiness; needs calibration

Heuristic: Use humans to define quality; use automation to enforce it repeatedly.

4) Pre-prod dataset: how to build it fast (without boiling the ocean)

Data sources

Manually curated (domain experts) → best to start
App logs → most realistic
Synthetic → expand coverage and edge cases

Synthetic generation that doesn’t suck (practical recipe)

chunk docs
build context (similar chunks)
generate query from context
“evolve” query to add complexity (Evol-Instruct)
generate expected output based only on context

Heuristics

Synthetic data is best for coverage gaps, not replacing real data.
Always keep a human-curated “golden core”.

5) Metrics that matter (especially for RAG)

Retrieval metrics

Contextual Precision: retrieved docs that are relevant (penalizes noise)
Contextual Recall: did you retrieve all necessary docs (penalizes missing info)
Contextual Relevancy: overall relevance of context to query

Generation metrics

Faithfulness: grounded in retrieved context (anti-hallucination)
Answer Relevancy: answers the question
Answer Correctness: aligns with expected output (when you have reference)

Note on BLEU/ROUGE: fast but often poorly correlated with humans for open-ended text. Use cautiously.

RAG evaluation map

6) CI/CD for LLM systems (how to keep it shippable)

Best practices

Two-tier test suites: small critical subset on every PR; full suite nightly/for releases
Cache LLM calls for unchanged inputs to cut cost
Plan for flaky judges: human review queue for failures (don’t block merges forever)

Heuristic: Treat eval cost like cloud cost—budget it, optimize it, and put guardrails on it.

7) LLM-as-a-Judge: how to make it reliable (Critique Shadowing)

The most common failure mode is “rate 1–10” prompts that produce garbage. The fix is judge alignment with your domain expert.

Critique Shadowing process

pick one principal domain expert (source of truth)
collect 30–50 examples with pass/fail + critique
build judge prompt with few-shot critiques
iterate until >90% agreement with expert
error analysis → reveals where your system needs work

Judge best practices

Start with binary Pass/Fail (more actionable than 1–10)
Use pairwise comparisons for subjective criteria (style/helpfulness)
Mitigate biases:
- positional bias → swap A/B order and rerun
- verbosity bias → instruct concision preference / normalize length
- self-enhancement bias → avoid judging with same model family if possible
Use a strong judge model even if your app model is cheaper

Judge calibration loop

8) Post-production: online eval + bootstrapping (where you actually win)

What to do in prod:

Tracing: log full lifecycle (input → retrieval → tool calls → output)
Feedback signals
- explicit: thumbs up/down
- implicit: immediate rephrase, abandon, copy-paste
Reference-free scoring using LLM judges (faithfulness, toxicity, relevancy)
Bootstrapping: convert failures into new offline test cases

Heuristic: Every incident is a gift—turn it into a regression test.

9) Tools landscape (what to pick when)

Tool	Best for	Notes
LangSmith	tracing + datasets + evals	strong LangChain-native workflow
DeepEval	open-source eval + Pytest	good CI integration
RAGAS	RAG metrics	retrieval + generation metrics
Arize Phoenix	observability + eval	open-source stack
TruLens	tracing + feedback functions	lightweight option

Default (AWS + LangChain): LangSmith for tracing/evals + RAGAS-style metrics where applicable.

10) Risk-adjusted quality bar (CTO sanity check)

Perfect eval is impossible; calibrate to risk:

High-stakes (medical/finance/legal): very high bar for faithfulness + safety, human review, strict guardrails
Low-stakes (internal summarization): tolerate higher error rate, rely on review workflows

Heuristic: Don’t chase perfection—chase a measured, improving system.

“Minimum viable eval stack” (what I’d ship first)

A golden set (50–100) + regression suite
RAG metrics (precision/recall/faithfulness) if retrieval exists
LLM judge calibrated via critique shadowing
Prod tracing + feedback + bootstrapping pipeline
CI policy: critical subset per PR, full suite nightly