Chapter 2: Core Distributions for MLOps
Essential probability distributions in production ML: Normal, Student-t, Bernoulli/Binomial, Poisson, Exponential, and when to use each
Distributions
Normal (Gaussian, "z")
What it models
- Measurement noise, aggregated effects, residual errors.
- The “default” approximation for averages due to CLT.
Where it appears in ML systems
- Regression residual assumptions (even if imperfect).
- Error distributions for monitoring (e.g., prediction error aggregated over time).
- Many statistical tests/intervals assume approximate normality of an estimator.
How to recognize
- Symmetric, bell-shaped; tails not too wild.
- Hist of residuals roughly symmetric; Q-Q plot close to line.
Practical notes
- Normality of raw data is often false; normality of mean / estimator can still hold.
- If you see heavy tails → switch to robust stats or transform.
Student-t
What it models
- Uncertainty in the mean when variance is unknown and sample size is small.
- Like Normal but heavier tails (more conservative).
Where it appears
- A/B tests on means (AOV, time-on-site) when N is not huge.
- Confidence intervals for mean with unknown σ.
Recognize
- Same shape as Normal but with heavier tails; converges to Normal as df↑.
Rule
- If you’re doing “mean difference” with unknown variance, you’re in t-world (or Welch’s t if variances differ).
Bernoulli / Binomial
Bernoulli
- Single trial 0/1 event: click, convert, success/failure label.
Binomial
- Sum of Bernoullis over (n): #conversions out of n visits.
Where it appears
- CTR, conversion rate, precision@k (binary outcomes), label prevalence.
- Online monitoring of success rates.
Recognize
- Data is 0/1 or counts of successes.
- Variance tied to mean: (p(1-p)).
What you do
- CIs for proportions (Wilson often beats naive Wald).
- Hypothesis tests: z-test / chi-square / Fisher exact.
- Bayesian: Beta posterior over (p) (next section).
Poisson (counts per interval)
What it models
- Counts of events in fixed time/space when events happen independently at a constant rate.
Where it appears
- Request counts per second/minute, errors per minute, clicks per hour.
- Incident rates.
Recognize
- Nonnegative integers; often variance ≈ mean (key property).
- Inter-arrival times ~ Exponential (dual relationship).
What breaks it
- Over-dispersion (variance >> mean) from burstiness, heterogeneity, seasonality.
What you do
- If variance >> mean → Negative Binomial (or mixture models).
- For monitoring: often use Poisson process intuition + change-point detection.
Exponential / Gamma
Exponential
What it models
- Waiting time between Poisson events (“time to next request/click”).
Where it appears
- Inter-arrival times, simplified failure times.
Recognize
- Positive continuous; memoryless property.
Gamma
What it models
- Sum of exponentials (waiting time to the k-th event).
- Positive skewed continuous values.
Where it appears
- Latency/service time modeling (sometimes), durations, time-to-complete tasks.
- Bayesian conjugate prior for Poisson rate (Gamma–Poisson).
Recognize
- Positive, right-skewed; can be flexible (shape controls skew).
Heuristic
- If it’s a positive skewed duration and lognormal isn’t perfect, Gamma is often a good candidate.
Beta
What it models
- A distribution over a probability (p\in[0,1]).
Where it appears
- Bayesian modeling of CTR/conversion.
- Uncertainty over rates, especially with small sample sizes.
Why it’s loved
-
Conjugate with Bernoulli/Binomial → posterior update is simple:
- prior Beta(α,β) + data (successes, failures) → Beta(α+succ, β+fail)
Recognize
- Values in [0,1], can be uniform, U-shaped, skewed, peaked depending on α,β.
Heuristic
- When you want “probability with uncertainty” for rates, Beta is the natural language.
Dirichlet
What it models
- Distribution over a probability vector (multiclass proportions), generalization of Beta.
Where it appears
- Class distribution drift monitoring (multinomial outcomes).
- Bayesian smoothing for categorical frequencies (tokens/categories).
Practical use
- Helps avoid zero-probabilities (smoothing) when computing divergences like KL/JS.
- Conjugate prior for Multinomial.
Heuristic
- If you have K categories and want a principled “prior + counts → posterior proportions,” Dirichlet.
Lognormal
What it models
- Positive quantities where the log is approximately normal.
- Multiplicative processes (many small multiplicative factors).
Where it appears
- Latency, response sizes, spend/AOV, time-on-task.
Recognize
- Positive, strong right tail; log(x) looks more normal.
- Mean >> median (common sign).
What you do
- Analyze on log scale (t-tests on log values often behave better).
- Report medians/quantiles; use trimmed means for heavy tails.
Heuristic
- For latency/spend: try log transform early.
Pareto / heavy-tailed families
What it models
- “A few huge values dominate” (power-law behavior).
Where it appears
- Spend/user, session durations, content popularity, tail latencies, long-tail item interactions.
Why it matters in MLOps
- Means become unstable; variance may be enormous.
- A/B tests and monitoring get noisy unless you use robust methods.
Recognize
- Extreme outliers; log-log plots show heavy tail-ish; top 1% contributes a huge fraction.
What you do
- Use robust stats: median, trimmed mean, winsorization.
- Consider quantile-based metrics; bootstrap.
- Sometimes model the tail separately (extreme value theory) if it’s core.
Heuristic
- If one user can swing your metric, you’re in heavy-tail land.
Negative Binomial (over-dispersed counts)
What it models
- Counts when variance > mean (Poisson is too “tight”).
- Often: a Poisson rate that varies across users/time (Gamma–Poisson mixture).
Where it appears
- Click counts per user, errors per service with burstiness, events per session.
Recognize
- Over-dispersion: sample variance >> sample mean.
- Many zeros + occasional big counts.
What you do
- Use NB for modeling; for tests, use robust/permutation methods if assumptions shaky.
Heuristic
- If Poisson underestimates spikes, switch to Negative Binomial.
Mixture models (multi-modal behavior)
What it models
- Data generated by multiple latent subpopulations.
Where it appears
- New vs returning users, bots vs humans, enterprise vs consumer traffic.
- Latency: cache hit vs miss
- CTR: different intent cohorts
Recognize
- Histogram shows multiple peaks; metrics differ wildly across segments.
What you do
- Segment explicitly (preferred in production).
- Or fit mixture (Gaussian mixture, etc.) for analysis.
- Monitor per-segment; global metrics can hide failures.
Heuristic
- If “overall average” doesn’t describe any real user group, suspect a mixture.
Categorical / Multinomial
Categorical
- Single draw from K categories.
Multinomial
- Counts over K categories across N draws.
Where it appears
- Label distribution, token frequencies, error codes, country/device distributions.
- Drift detection: “did category mix change?”
Recognize
- Finite set of labels/categories; you track proportions.
What you do
- Chi-square tests / G-tests; JS divergence; Dirichlet smoothing.
- Watch for rare categories (small expected counts → Fisher/Monte Carlo).
Heuristic
- For categorical drift, start with simple proportion monitoring + chi-square/JS; don’t overcomplicate.
A few cross-cutting heuristics (high leverage)
-
Check support first: is it {0,1}, nonnegative integers, positive reals, [0,1], simplex? That narrows choices fast.
-
Mean vs variance relationship:
- Poisson: var ≈ mean
- NB: var > mean
-
Tail behavior decides your stats:
- Heavy tails → quantiles/robust/bootstrap
-
Heterogeneity beats “fancy fitting”:
- If mixture suspected, segment first.