Create Next App

Production Testing, A/B Testing & Experimentation

What this is (and isn’t)

Goal: make safe, causal, repeatable improvements to ML systems in production.
Not the goal: “ship and pray” or “offline metrics decide everything”.

1) Core mental models

Mental model A — “Experiment = Causality pipeline”

You’re building a pipeline that converts a change → randomized exposure → measured outcomes → causal decision.

If any stage is weak, your conclusion is weak:

Assignment (randomization / bucketing)
Exposure (did users actually see the treatment?)
Measurement (logging + metric correctness)
Analysis (valid stats, no peeking traps)
Decision (ship/rollback/ramp + learning captured)

Mental model B — 4 outcome types you must track

For each variant, always think in:

Business outcomes (conversion, revenue, retention, cost)
Model quality (task metrics, calibration, relevance, fairness slices)
System health (latency, error rate, saturation, timeouts)
Safety/guardrails (policy violations, risky actions, fraud, abuse)

Mental model C — “Risk ladder” for launches

Start with lowest blast radius and climb only if safe: Offline eval → Shadow → Canary → A/B → Ramp to 100%

2) Choose the right production testing method

Quick decision heuristic

Need causal impact on user/business metrics? → A/B test
Need safety/latency/regression confidence first? → Shadow / Canary
Ranking/reco changes with fast feedback needs? → Interleaving (often better sensitivity than A/B)

Comparison table (what to use when)

Method	What it answers	Best for	Pros	Cons / gotchas
Shadow (dark launch)	“Does it behave + perform safely under real traffic?” (no user impact)	New models, new features, infra changes	Lowest risk, great for latency/compat	No business signal; label delays still exist
Canary (gradual rollout)	“Is it safe at small % and stable?”	High-risk systems, new serving stack	Detects outages quickly; controlled blast radius	Not causal unless randomized + measured carefully
A/B test (RCT)	“Does it cause metric lift?”	Product decisions, model swaps, feature changes	Gold standard for causality	Needs enough traffic + clean instrumentation
Interleaving	“Which ranking is better?” with higher sensitivity	Search/reco/ranking	Faster signal, less variance	Not for arbitrary product metrics; requires ranking setup
Bandit (adaptive)	“Maximize reward while learning”	When opportunity cost is high	Learns while optimizing	Harder analysis; non-stationarity + bias risks

3) Designing a production experiment (practical checklist)

Step 1 — Write the experiment contract (1 page)

Hypothesis: “If we change X, metric Y improves because Z.”
Primary metric (single): the one you will optimize.
Guardrails: metrics that must not degrade beyond threshold (latency, errors, safety).
Target population: who is eligible, who is excluded, what % holdout if needed.
Decision rule: ship / don’t ship / iterate, and what counts as “win”.

Step 2 — Define what “success” means statistically

Pick:
- MDE (minimum detectable effect): smallest lift worth shipping.
- Power (common default: ~80%) and significance level (common default: 5%).
Decide upfront:
- Fixed-horizon (run to planned sample size) or
- Sequential testing (if you need to peek safely—use proper sequential methods, not ad-hoc)

Step 3 — Instrumentation rules (non-negotiable)

Log at minimum:
- experiment_id, variant, assignment_id (stable), timestamp
- exposure events (did treatment happen?)
- outcome events + any delayed labels linkage keys
Ensure idempotent logging (retries won’t double-count).

Step 4 — Guardrail-first launch

Before you care about lift, you must ensure:

no new 5xx spikes
no latency regressions
no safety regressions
no data pipeline breakage

4) The failure modes that kill experiments (and how to catch them)

1) Sample Ratio Mismatch (SRM)

Symptom: traffic split isn’t what you intended. Causes: hashing changes, eligibility bugs, bot traffic, caching, sticky sessions miswired. Practice: run SRM checks early and continuously; treat SRM as “invalid experiment until fixed”.

2) Novelty + learning effects

Short experiments can lie.

Users react to “newness”
Models may need time to warm up (exploration, caches, feedback loops) Heuristic: run long enough to cover at least one full behavior cycle (often weekly patterns).

3) Interference + network effects

One user’s treatment affects another’s outcome (marketplaces, social, auctions).

Consider cluster randomization, geo splits, or careful scoping.

4) Metric invalidity

If the metric is wrong, the experiment is theatre.

Validate metric computation with replay/backfill
Ensure consistent definition across variants

5) Peeking without sequential methods

Stopping when p-value “looks good” inflates false positives.

Either commit to fixed-horizon or use sequential testing correctly.

5) Metrics: how to structure them like a pro

Metric taxonomy (use this structure)

Primary: single north-star for the experiment
Secondary: supporting signals (diagnostics, segmentation)
Guardrails: must-not-break (latency, error rate, safety, cost)

Slicing rules (where wins/losses hide)

Always slice results by:

new vs returning users
geo/device/app version
latency buckets
head vs tail queries (search/reco)
cold-start vs warm-cache
key cohorts relevant to fairness/quality

6) Production rollout strategy (recommended default)

Default rollout recipe (safe + fast)

Shadow: verify parity, latency, resource, failure modes
Canary (1–5%): validate stability + guardrails
A/B (10–50%): measure causal impact
Ramp: 50% → 100% with guardrail monitoring
Holdout (optional): keep small control % long-term to detect drift/regressions

7) Reference architectures (Mermaid)

A) Experiment lifecycle

Diagram 1

B) Traffic routing + measurement plane

Diagram 2

C) Choosing the method (fast decision map)

Diagram 3

8) Operational best practices (what seasoned teams do)

“Always-on” experiment platform capabilities

Deterministic bucketing (stable assignment)
Feature flags + config decoupled from deploy
Real-time guardrail dashboards
Automated SRM detection
Kill switch (instant disable treatment)
Replay/backfill for delayed labels (fraud, churn, refunds)
Decision log (why shipped, what learned)

Practical guardrail thresholds (how to think)

Instead of one hard global number:

use SLOs (p95 latency, error budgets)
use cost guardrails (GPU hours / request, cache hit rate)
use safety guardrails (policy violations per 10k requests)

9) “Ship/no-ship” decision checklist (copy/paste)

No SRM / assignment anomalies
Guardrails within limits (latency/errors/safety/cost)
Primary metric effect is statistically valid per planned method
Results consistent across key slices (or you understand why not)
No major logging gaps / exposure mismatch
Rollback plan tested; kill switch works
Learning captured (what worked, what didn’t, next bet)