Chapter 7: Uncertainty and Decision-Making
Calibration, confidence intervals, and statistical decision frameworks for production ML systems
Uncertainty, calibration, decision-making
Mental model
A model output is useful only when you know (a) how uncertain it is and (b) how to act on it given costs.
Think of 3 layers:
- Score quality (ranking / discrimination: AUC, NDCG)
- Probability quality (calibration: reliability, Brier, log loss)
- Decision quality (utility under costs/constraints: thresholding, policies)
1) Predictive uncertainty: aleatoric vs epistemic
Aleatoric uncertainty = irreducible noise
Uncertainty from the world/data itself:
- blurry image, ambiguous text, inherently stochastic outcomes
- even a perfect model can’t remove it
How it shows up
- consistent ambiguity across models
- persists with more training data
How to handle
- improve measurement/features, collect better labels
- model it explicitly (e.g., heteroscedastic regression predicting variance)
Epistemic uncertainty = model ignorance
Uncertainty from lack of knowledge:
- out-of-distribution inputs
- not enough data in certain regions of feature space
How it shows up
- reduces with more data
- spikes on new regions, new cohorts, drift
How to estimate (practical options)
- Deep ensembles: train multiple models; disagreement = epistemic signal (strong baseline)
- MC dropout: approximate Bayesian; multiple stochastic forward passes
- Bayesian linear / GP: more principled but less common at scale
- Distance-to-training signals: Mahalanobis in representation space, kNN density
Heuristic: if you need uncertainty in production tomorrow, start with ensembles + OOD heuristics.
2) Calibration: turning scores into trustworthy probabilities
What calibration means
If the model predicts 0.8, then among such cases, ~80% should be positive.
Reliability diagram
-
Bin predictions into buckets (e.g., 0–0.1, …, 0.9–1.0)
-
For each bin, plot:
- x = mean predicted probability
- y = empirical positive rate
Perfect calibration: points lie on diagonal.
Gotcha: choose bins carefully and ensure enough samples per bin; otherwise it’s noisy.
ECE (Expected Calibration Error)
A weighted average of bin gaps: ECE = sum_b (n_b / n) * |acc(b) - conf(b)|
- acc(b): empirical positive rate in bin b
- conf(b): mean predicted probability in bin b
Pros: single number. Cons: depends heavily on binning; can hide where miscalibration occurs.
Heuristic: use ECE as a monitoring scalar, but keep the reliability curve for diagnosis.
Brier score
Brier = (1 / n) * sum_i (p_i - y_i)^2 Measures probability accuracy (lower is better). It combines calibration + sharpness.
Heuristic: Brier is a good default for “probabilities matter.”
Calibration methods
Temperature scaling (multiclass, deep nets)
Scale logits by T: p = softmax(z / T)
- T>1 softens; T<1 sharpens
- Preserves ranking (AUC unchanged)
- Usually very effective, low risk
Platt scaling (binary)
Fit a logistic regression on the model score/logit to map to calibrated probability.
Isotonic regression
Non-parametric monotonic mapping.
- Very flexible (can fix weird calibration curves)
- Risk of overfitting with small calibration sets
Heuristic
- Large validation set → isotonic can be great
- Smaller set → temperature/Platt is safer
3) Decision thresholds: from probability to action
Bayes-optimal threshold (cost-sensitive)
If predicting positive has cost (C_FP) when wrong and missing a positive has cost (C_FN): predict positive if p(y = 1 | x) >= C_FP / (C_FP + C_FN) (assuming equal benefit scaling)
Key point: thresholds should come from costs + constraints, not “0.5”.
Constraints change everything
Real systems often have:
- review capacity (only N cases/day)
- latency budgets
- risk limits
So you choose thresholds like:
- “Top-K by score”
- “Maximize recall subject to precision ≥ P”
- “Keep FPR below X”
Heuristic: treat thresholding as an optimization problem, not a magic constant.
4) Bayesian vs frequentist framing for risk decisions
Frequentist view
- CIs, p-values, error rates over repeated trials
- Great for A/B testing and reporting
Bayesian view
- Posterior distribution over uncertainty
- Natural for decision-making: choose action maximizing expected utility: a* = argmax_a E[U(a, θ) | D]
Practical heuristic
- Use frequentist tools for controlled experiments/reporting.
- Use Bayesian reasoning when you must make decisions under uncertainty with asymmetric costs (risk systems, medical triage, fraud review).
5) Expected utility / decision curves
Instead of optimizing AUC, optimize business outcome: EU(t) = E[Benefit - Cost | threshold t]
Decision curve analysis (common framing):
- Compare net benefit across thresholds, relative to “treat all” and “treat none.”
- Helps communicate to stakeholders: where the model is actually beneficial.
Heuristic: if you can’t map metric lift to dollars/harms, you’ll argue forever about thresholds.
6) Conformal prediction (coverage guarantees)
What it gives you
A principled way to produce:
- Prediction sets for classification (set of labels)
- Prediction intervals for regression
With a guarantee:
Under exchangeability, coverage is (1-\alpha). E.g., 90% of future points will have true label inside the predicted set.
How it works (high level)
- Split data: train model + calibration set
- Compute nonconformity scores on calibration points (how “strange” each example is)
- Choose a quantile threshold from these scores
- For a new point, output set/interval of outcomes whose score is below threshold
Variants you’ll hear
- Split conformal: simplest, most used
- Mondrian conformal: conditional coverage by groups (e.g., per class/segment)
- Conformalized quantile regression: strong for regression intervals
Where it shines in MLOps
- Abstention / human-in-the-loop: if prediction set is large/uncertain → route to review.
- Safety: “only auto-act when confidence set is single label / narrow interval.”
- Works even when model is miscalibrated, because it calibrates coverage.
Gotchas
- Requires approximate exchangeability; drift breaks guarantees.
- Gives marginal coverage, not necessarily per-slice (use Mondrian or monitor).
- Coverage can be achieved with very large sets if the model is weak.
Heuristic: conformal is a great “safety wrapper” around any model—monitor set size as a health signal.
Practical summary
- Uncertainty splits into aleatoric (irreducible) and epistemic (lack of knowledge / OOD); ensembles are a strong production baseline for epistemic.
- Calibration is about probability correctness (reliability curves, ECE, Brier). Use temperature/Platt/isotonic depending on data size and risk.
- Decisions should be driven by expected utility under costs/constraints; optimal thresholds come from costs, not 0.5.
- Conformal prediction provides coverage guarantees and is very practical for abstention and safety.