A/B testing playbook for ML models

An online ML A/B test is causal inference with operational constraints:

Randomization creates comparable groups (causality)
Sizing controls false alarms + misses (statistics)
Instrumentation + invariants prevent “invalid wins” (systems)

Golden rule: randomize at the unit of interference (usually user), analyze at that same unit.

1) Design & sizing

1.1 Define the decision first

Write down:

Primary metric (one)
Guardrails (a few)
Ship rule (example): “Ship if +MDE on primary with 95% CI excluding 0, and no guardrail regression beyond tolerance.”

This prevents p-hacking and metric shopping.

1.2 MDE, α, power → sample size

Definitions

MDE: minimum detectable effect you care about (practical).
α: false positive tolerance (often 0.05).
Power (1−β): probability to detect MDE (often 0.8 or 0.9).

Sizing intuition

Sample size grows fast when:

baseline rate is low (rare conversions)
variance is high (AOV, latency)
MDE is tiny (trying to detect 0.1% lift)

Heuristic: decide MDE from business value and guardrail risk; don’t size for “whatever is detectable.”

Approx sizing formulas (useful in conversation)

Two-proportion (CTR/conversion) Let baseline (p), target lift (Δ). Roughly:

n ∝ p(1 - p) / Δ²

(per arm, constant depends on z-scores for α and power)

Mean metrics (AOV, latency)

n ∝ σ² / Δ²

where (σ) is std dev of the per-unit metric.

Key production point: compute σ on the right unit (user-level aggregates), not raw events.

1.3 Duration

Duration isn’t just “n / traffic”. You must cover:

day-of-week cycles
seasonality/campaigns
delayed outcomes (conversion lag)

Heuristic: minimum 1–2 full weekly cycles for consumer products; more if strong seasonality or delayed conversion.

1.4 Variance estimation & CUPED

CUPED (variance reduction)

Use a pre-experiment covariate correlated with the metric (e.g., user’s past 7-day spend) to reduce variance:

Adjust metric by removing predictable part from baseline behavior.
Same mean effect, lower variance → smaller sample required.

Heuristic: CUPED is high ROI if metric is noisy and you have good pre-period covariates.

1.5 Randomization unit: user vs session vs request

User-level: default for most ML changes (avoids contamination).
Session-level: only if sessions are independent and you can tolerate cross-session contamination.
Request-level: generally risky for ranking/recs (user sees both variants → interference).

Rule: choose the unit where treatment does not spill over.

1.6 Stratification / blocking / cluster-randomization

Stratify when

You have known high-variance segments (geo/device/new vs returning). Ensures balance and improves power.

Cluster randomization

If interference exists within clusters (household, company account, marketplace network), randomize at cluster.

Heuristic: if users can influence each other’s experience, user-level randomization may still be invalid.

1.7 SRM (Sample Ratio Mismatch)

SRM = your traffic split is not what you intended (50/50 becomes 52/48). Often indicates:

bucketing bug
filtering differences
instrumentation issues
bots/routing differences

Practical

Always run an SRM check early (chi-square test on counts).
Treat SRM as “invalidate experiment until explained.”

2) Metric choice & estimation

2.1 Guardrails vs success metrics

Success metric: what you want to improve (CTR, retention, revenue, relevance).
Guardrails: must not regress (latency p95, error rate, diversity, safety complaints, cost).

Heuristic: 1 primary, 3–7 guardrails. Too many = paralysis and multiple-testing hell.

2.2 Ratio metrics (CTR, AOV per user) — the classic pitfall

CTR can be defined as:

Global ratio: total clicks / total impressions
Mean of user CTRs: average(clicks/impressions per user)

They are not equivalent. Pick based on what you want to optimize.

Heuristic: analyze at the randomization unit (user). Compute per-user numerator/denominator, then aggregate.

Delta method vs bootstrap

Delta method: analytic approximation for ratio variance; fast, common.
Bootstrap: robust and flexible; slower but safer for complex metrics.

Rule: if metric is non-linear/ugly or heavy-tailed → bootstrap.

2.3 Heavy-tailed metrics (AOV, time spent, latency)

Typical strategies:

Log transform (model multiplicative effects)
Trimmed mean (drop extreme tails)
Winsorization (cap extremes)
Report quantiles (p50/p95/p99) rather than only means

Heuristic: for spend/time, use robust methods + bootstrap CIs.

2.4 Sensitivity analysis

Before shipping:

check lift consistency across key slices (new/returning, geo, device)
check metric definition variants (global ratio vs per-user ratio)
check effect across time (novelty fade)

Heuristic: if lift only exists in one slice, treat it as hypothesis → rerun with targeted stratification.

3) Analysis: which test for which metric

3.1 Default: estimate effect + CI

Avoid “p-value only.” Provide:

point estimate (absolute + relative)
95% CI
decision against MDE

3.2 Test selection map (practical)

Binary outcomes (conversion)

Two-proportion z-test / chi-square
If counts are small → Fisher exact
Better practice at scale: compute per-user conversion and use robust/bootstrap too (keeps unit consistent)

Means (AOV, revenue per user, latency after transform)

Welch’s t-test (default)
If heavy-tailed → bootstrap CI or permutation test

Quantiles (p95/p99 latency)

Bootstrap CI for quantile difference (quantile regression is also possible but heavier)

Ranking metrics (NDCG, MRR) in online setting

Often computed per-user/per-session; use bootstrap or permutation at that unit.

Heuristic: permutation/bootstrap is the “universal solvent” when assumptions are unclear.

3.3 Multiple metrics correction

If you look at many metrics and pick winners, you inflate false positives.

Use a primary metric decision.
For many secondary metrics: control FDR (BH) or treat as exploratory.

4) Real-world gotchas (the stuff that breaks experiments)

4.1 Interference & network effects

Recs/search can change marketplace dynamics, inventory, seller behavior.
One user’s treatment can affect another user’s outcomes.

Mitigations:

cluster randomization
geo experiments
switchback designs (time-based) for platform-wide changes

4.2 Novelty effects

Users react to change initially, then revert.

Mitigation:

run long enough to see stabilization
analyze time-sliced effects (day 1 vs day 7)

4.3 Caching, non-independence, repeated exposure

CDN/app caches can cause one variant’s results to leak to another.
Users can see both variants if bucketing isn’t sticky.

Mitigation:

sticky assignment, cache keys include variant, analyze at user level.

4.4 Logging changes mid-experiment

Any instrumentation change can create fake lift.

Mitigation:

freeze logging; use invariants (event counts, schema checks).

4.5 Drift during experiment (seasonality, campaigns)

If baseline shifts while running, naive analysis misleads.

Mitigation:

run full cycles; stratify by time; CUPED; switchback if needed.

4.6 Delayed feedback + missingness

Conversions happen days later; labels are censored.

naive “conversion so far” biases against variants that delay/accelerate conversions.

Mitigation:

choose a fixed attribution window
survival analysis / delay modeling for serious cases
report both “early” and “matured” metrics

5) ML-specific concerns

5.1 Feedback loops

In ranking/recs, new model changes what users see → changes logged data. This can amplify or hide effects.

Mitigation:

guardrails on diversity/coverage
monitor distribution shifts in served items
consider interleaving (for ranking) when appropriate

5.2 Model + policy coupling

If you change threshold, you change base rates and downstream workloads. Example: fraud threshold changes alert volume.

Mitigation:

evaluate policy impact: precision/recall at operating points, capacity constraints
cost-based evaluation, not just AUC

5.3 Offline-online mismatch

Offline metrics may not predict online outcomes. Common in recsys/search, and with human feedback loops.

Mitigation:

treat offline as filter; online as truth
maintain an offline–online correlation tracker over time

A concrete “step-by-step runbook”

Define primary metric, guardrails, MDE, α, power, duration
Pick unit of randomization (usually user) + ensure sticky assignment
Instrument invariants (SRM, event counts, latency, errors)
Estimate variance on historical data (per-unit)
Size sample + duration (+ plan for lag)
Launch with ramp-up (1% → 10% → 50%) while watching guardrails/SRM
Analyze using effect + CI; use robust/bootstrap for heavy-tailed/complex metrics
Decide against MDE + guardrails; correct for multiple comparisons if exploring many
Post-mortem: did assumptions hold? update playbook + variance estimates

Tiny “code pointer” (no long code)

In Python, you’ll commonly use:

statsmodels.stats.proportion.proportions_ztest (rates)
scipy.stats.ttest_ind(..., equal_var=False) (Welch)
scipy.stats.mannwhitneyu (rank-based)
bootstrap via numpy.random.choice or scipy.stats.bootstrap