Chapter 12.2

Chapter 12.2: Production Testing & A/B Testing

Make safe, causal, repeatable improvements through production experiments

Production Testing, A/B Testing & Experimentation

What this is (and isn’t)

  • Goal: make safe, causal, repeatable improvements to ML systems in production.
  • Not the goal: “ship and pray” or “offline metrics decide everything”.

1) Core mental models

Mental model A — “Experiment = Causality pipeline”

You’re building a pipeline that converts a changerandomized exposuremeasured outcomescausal decision.

If any stage is weak, your conclusion is weak:

  1. Assignment (randomization / bucketing)
  2. Exposure (did users actually see the treatment?)
  3. Measurement (logging + metric correctness)
  4. Analysis (valid stats, no peeking traps)
  5. Decision (ship/rollback/ramp + learning captured)

Mental model B — 4 outcome types you must track

For each variant, always think in:

  • Business outcomes (conversion, revenue, retention, cost)
  • Model quality (task metrics, calibration, relevance, fairness slices)
  • System health (latency, error rate, saturation, timeouts)
  • Safety/guardrails (policy violations, risky actions, fraud, abuse)

Mental model C — “Risk ladder” for launches

Start with lowest blast radius and climb only if safe: Offline eval → Shadow → Canary → A/B → Ramp to 100%


2) Choose the right production testing method

Quick decision heuristic

  • Need causal impact on user/business metrics? → A/B test
  • Need safety/latency/regression confidence first? → Shadow / Canary
  • Ranking/reco changes with fast feedback needs? → Interleaving (often better sensitivity than A/B)

Comparison table (what to use when)

Method What it answers Best for Pros Cons / gotchas
Shadow (dark launch) “Does it behave + perform safely under real traffic?” (no user impact) New models, new features, infra changes Lowest risk, great for latency/compat No business signal; label delays still exist
Canary (gradual rollout) “Is it safe at small % and stable?” High-risk systems, new serving stack Detects outages quickly; controlled blast radius Not causal unless randomized + measured carefully
A/B test (RCT) “Does it cause metric lift?” Product decisions, model swaps, feature changes Gold standard for causality Needs enough traffic + clean instrumentation
Interleaving “Which ranking is better?” with higher sensitivity Search/reco/ranking Faster signal, less variance Not for arbitrary product metrics; requires ranking setup
Bandit (adaptive) “Maximize reward while learning” When opportunity cost is high Learns while optimizing Harder analysis; non-stationarity + bias risks

3) Designing a production experiment (practical checklist)

Step 1 — Write the experiment contract (1 page)

  • Hypothesis: “If we change X, metric Y improves because Z.”
  • Primary metric (single): the one you will optimize.
  • Guardrails: metrics that must not degrade beyond threshold (latency, errors, safety).
  • Target population: who is eligible, who is excluded, what % holdout if needed.
  • Decision rule: ship / don’t ship / iterate, and what counts as “win”.

Step 2 — Define what “success” means statistically

  • Pick:

    • MDE (minimum detectable effect): smallest lift worth shipping.
    • Power (common default: ~80%) and significance level (common default: 5%).
  • Decide upfront:

    • Fixed-horizon (run to planned sample size) or
    • Sequential testing (if you need to peek safely—use proper sequential methods, not ad-hoc)

Step 3 — Instrumentation rules (non-negotiable)

  • Log at minimum:

    • experiment_id, variant, assignment_id (stable), timestamp
    • exposure events (did treatment happen?)
    • outcome events + any delayed labels linkage keys
  • Ensure idempotent logging (retries won’t double-count).

Step 4 — Guardrail-first launch

Before you care about lift, you must ensure:

  • no new 5xx spikes
  • no latency regressions
  • no safety regressions
  • no data pipeline breakage

4) The failure modes that kill experiments (and how to catch them)

1) Sample Ratio Mismatch (SRM)

Symptom: traffic split isn’t what you intended. Causes: hashing changes, eligibility bugs, bot traffic, caching, sticky sessions miswired. Practice: run SRM checks early and continuously; treat SRM as “invalid experiment until fixed”.

2) Novelty + learning effects

Short experiments can lie.

  • Users react to “newness”
  • Models may need time to warm up (exploration, caches, feedback loops) Heuristic: run long enough to cover at least one full behavior cycle (often weekly patterns).

3) Interference + network effects

One user’s treatment affects another’s outcome (marketplaces, social, auctions).

  • Consider cluster randomization, geo splits, or careful scoping.

4) Metric invalidity

If the metric is wrong, the experiment is theatre.

  • Validate metric computation with replay/backfill
  • Ensure consistent definition across variants

5) Peeking without sequential methods

Stopping when p-value “looks good” inflates false positives.

  • Either commit to fixed-horizon or use sequential testing correctly.

5) Metrics: how to structure them like a pro

Metric taxonomy (use this structure)

  • Primary: single north-star for the experiment
  • Secondary: supporting signals (diagnostics, segmentation)
  • Guardrails: must-not-break (latency, error rate, safety, cost)

Slicing rules (where wins/losses hide)

Always slice results by:

  • new vs returning users
  • geo/device/app version
  • latency buckets
  • head vs tail queries (search/reco)
  • cold-start vs warm-cache
  • key cohorts relevant to fairness/quality

6) Production rollout strategy (recommended default)

Default rollout recipe (safe + fast)

  1. Shadow: verify parity, latency, resource, failure modes
  2. Canary (1–5%): validate stability + guardrails
  3. A/B (10–50%): measure causal impact
  4. Ramp: 50% → 100% with guardrail monitoring
  5. Holdout (optional): keep small control % long-term to detect drift/regressions

7) Reference architectures (Mermaid)

A) Experiment lifecycle

Diagram 1

B) Traffic routing + measurement plane

Diagram 2

C) Choosing the method (fast decision map)

Diagram 3


8) Operational best practices (what seasoned teams do)

“Always-on” experiment platform capabilities

  • Deterministic bucketing (stable assignment)
  • Feature flags + config decoupled from deploy
  • Real-time guardrail dashboards
  • Automated SRM detection
  • Kill switch (instant disable treatment)
  • Replay/backfill for delayed labels (fraud, churn, refunds)
  • Decision log (why shipped, what learned)

Practical guardrail thresholds (how to think)

Instead of one hard global number:

  • use SLOs (p95 latency, error budgets)
  • use cost guardrails (GPU hours / request, cache hit rate)
  • use safety guardrails (policy violations per 10k requests)

9) “Ship/no-ship” decision checklist (copy/paste)

  • No SRM / assignment anomalies
  • Guardrails within limits (latency/errors/safety/cost)
  • Primary metric effect is statistically valid per planned method
  • Results consistent across key slices (or you understand why not)
  • No major logging gaps / exposure mismatch
  • Rollback plan tested; kill switch works
  • Learning captured (what worked, what didn’t, next bet)