Chapter 4

Chapter 4: Statistical Distance Measures

Methods for quantifying differences between distributions: KL divergence, KS test, Wasserstein, PSI, and when to use each

Statistical distance measures

0) Two axes to choose a distance

A) What kind of data?

  • Categorical / discrete histogram → KL/JS/TV/PSI
  • Continuous 1D → Wasserstein (often best), KS distance (as a simple max-CDF gap)
  • High-D vectors/embeddings → cosine / Mahalanobis / MMD/energy (distance between distributions), or summary stats + slices

B) What failure mode matters?

  • Tail changes (p99 latency drift) vs center shift vs mixture shift.
  • Interpretability vs sensitivity vs stability.

1) KL divergence

[ D_{KL}(P|Q)=\sum_x P(x)\log\frac{P(x)}{Q(x)} ]

What it measures “How inefficient it is to code samples from P using a model optimized for Q.” Not symmetric, not bounded.

Use when

  • You have probability distributions (often histograms) and care about direction: “production looks unlike training.”
  • You want something sensitive to support mismatches (Q puts near-zero where P has mass).

Gotchas

  • If (Q(x)=0) where (P(x)>0) → KL = ∞. In practice, this happens constantly with finite samples.
  • Overly sensitive to tiny probabilities and binning.

Heuristic Use KL only with smoothing (add ε) and careful binning; otherwise prefer JS.


2) Jensen–Shannon (JS) divergence

[ JS(P,Q)=\tfrac{1}{2}KL(P|M)+\tfrac{1}{2}KL(Q|M), \quad M=\tfrac{P+Q}{2} ]

Why people like it

  • Symmetric
  • Bounded (0 to log2; often sqrt(JS) used as a metric)
  • More stable than KL when supports don’t match perfectly.

Use when

  • You need a robust distance for histogram drift in monitoring.
  • You want symmetry: “distance between train and prod” regardless of direction.

Gotcha Still binning-sensitive; for continuous features you’re really measuring “binned distribution difference.”

Heuristic If you’re picking one histogram-based divergence for drift dashboards, JS is a great default.


3) PSI (Population Stability Index)

Industry standard in credit risk.

For bins (i): [ PSI=\sum_i (p_i - q_i)\log\frac{p_i}{q_i} ] where (p_i) is baseline proportion and (q_i) is current proportion.

Interpretation A weighted, symmetric-ish measure of shift in binned distributions (it resembles a divergence).

Use when

  • You need an easy-to-explain drift score for regulators/ops.
  • You have numeric features and can define bins (often based on training quantiles).

Typical rule-of-thumb thresholds (common in practice; treat as heuristics):

  • <0.1: small shift
  • 0.1–0.25: moderate
  • 0.25: large

Gotchas

  • Extremely sensitive to binning and how you handle zeros.
  • PSI can be inflated by tiny baseline bin probabilities.

Heuristic Use training-quantile bins (equal-frequency) + epsilon smoothing for empty bins. Always pair PSI with a plot (CDF/hist).


4) Wasserstein / EMD (Earth Mover’s Distance)

For 1D continuous distributions:

  • Intuition: minimum “work” to move probability mass from P to Q.

Why it’s great

  • Highly interpretable: measured in the same units as the variable (e.g., milliseconds).
  • Works naturally for continuous variables and is meaningful even when supports differ.
  • Sensitive to shifts without being as brittle as KL.

Use when

  • Continuous 1D drift: latency, scores, numeric features.
  • You want a scalar that stakeholders understand (“the distribution moved by ~X units”).

Gotchas

  • In higher dimensions, Wasserstein is expensive and can be sample-hungry; 1D is the sweet spot.
  • Can miss certain changes (e.g., same mean shift? depends) — always inspect shape.

Heuristic For continuous features: start with Wasserstein + quantile deltas (p50/p95/p99).


5) Total Variation (TV) distance

For discrete: [ TV(P,Q)=\tfrac{1}{2}\sum_x |P(x)-Q(x)| ]

What it measures Max difference in probabilities; very interpretable (“how much mass differs”).

Use when

  • You want a simple, bounded difference score for categorical distributions.
  • Useful for monitoring label mix drift.

Gotcha Ignores geometry (category similarity doesn’t exist); purely frequency mismatch.

Heuristic TV is the “L1 distance between histograms.” Great baseline.


6) Hellinger distance

[ H(P,Q)=\frac{1}{\sqrt{2}}\sqrt{\sum_x (\sqrt{P(x)}-\sqrt{Q(x)})^2} ]

Why it’s useful

  • Symmetric, bounded [0,1]
  • Less sensitive to tiny probabilities than KL
  • Nice mathematical properties

Use when

  • You want a stable histogram divergence that’s not as spiky as KL.
  • Often good for probability distributions with many small bins.

Heuristic If JS feels too sensitive, Hellinger is a strong alternative.


7) Cosine distance (for embeddings / high-D vectors)

[ \text{cosine_sim}(u,v)=\frac{u\cdot v}{|u||v|} ] Cosine distance = 1 − cosine similarity.

Use when

  • You compare individual embeddings (semantic similarity).
  • Many embedding spaces are trained so angle matters more than magnitude → cosine is natural.

For drift You don’t usually cosine-compare distributions directly. Common approaches:

  • Compare centroids (mean embedding) by cosine
  • Track mean pairwise cosine to a reference set
  • Monitor changes in nearest-neighbor structure

Heuristic Normalize embeddings (L2) and treat cosine as default similarity unless you know magnitudes carry meaning.


8) Mahalanobis distance (for “how far from baseline”)

For a vector (x) with baseline mean (\mu) and covariance (\Sigma): [ d_M(x)=\sqrt{(x-\mu)^T\Sigma^{-1}(x-\mu)} ]

What it’s good at

  • Detecting outliers / “distance from normal” while accounting for feature correlations.
  • Useful for feature drift as “how abnormal is this point vs training distribution?”

Use when

  • Features are roughly elliptical/normal-ish in a representation space (often after standardization/PCA).
  • You want a single anomaly score per sample or per batch.

Gotchas

  • Covariance estimation is brittle in high dimensions (needs lots of data or shrinkage).
  • If distribution is non-Gaussian / multimodal, Mahalanobis can be misleading.

Heuristic Use Mahalanobis after:

  • standardization
  • dimensionality reduction (PCA)
  • covariance shrinkage (Ledoit–Wolf) if high-d.

Practical section: binning, smoothing, and “when distances lie”

1) Binning choices (critical)

  • Equal-width bins: intuitive but bad if data is skewed (most bins empty).
  • Quantile bins (equal-frequency): great for stability and PSI/JS; preserves resolution where data exists.
  • Domain bins: best when semantics matter (latency buckets, price ranges).

Heuristic For monitoring numeric features: use training quantile bins + fixed edges.

2) Smoothing / zero handling

For KL/JS/PSI, empty bins cause blow-ups.

  • Add small ε (Laplace/Dirichlet smoothing)
  • Or merge rare bins into “other”

Heuristic Always do smoothing for divergences, and log the fraction of mass in “other/unknown.”

3) Sample-size sensitivity

  • With huge samples, tiny distribution changes look “large enough” in some metrics.
  • With small samples, metrics are noisy.

Heuristic Track uncertainty: bootstrap CI of the distance (yes, bootstrap your drift metric).

4) High-dimensional drift: distances can lie

In high-D, distances concentrate (everything looks equally far). Practical alternatives:

  • Compare 1D projections (PCA components) with Wasserstein/KS
  • Classifier-based drift (train a model to distinguish “train vs prod”; AUC close to 0.5 means no drift)
  • MMD/Energy as distribution tests rather than naive vector distances

Quick “which to use when” cheat map

  • Categorical mix drift → TV / JS / Chi-square (test)
  • Numeric 1D drift → Wasserstein + quantile deltas; KS/AD as tests
  • Need explainability / business-friendly → PSI (with quantile bins)
  • Histogram divergence default → JS (stable + symmetric)
  • Embeddings drift → centroid cosine + PCA-projection Wasserstein; optionally Mahalanobis in reduced space