Calibration Guide#
Model calibration ensures that predicted confidence scores accurately reflect empirical accuracy. A well-calibrated model’s confidence estimates are trustworthy: if it predicts 80% confidence, it should be correct 80% of the time.
Why Calibration Matters#
Modern neural networks are often overconfident - they produce high confidence scores even for incorrect predictions. This is especially problematic in:
Safety-critical applications: Medical diagnosis, autonomous vehicles
Decision making: Where confidence guides resource allocation
Human-AI collaboration: Users need trustworthy uncertainty estimates
- Example of miscalibration:
A classifier predicts 95% confidence but is only correct 70% of the time.
Calibration Metrics#
Expected Calibration Error (ECE)#
The most common calibration metric. Divides predictions into bins by confidence and measures the gap between confidence and accuracy:
from incerto.calibration import ece_score
# logits: model outputs (N, num_classes)
# labels: ground truth (N,)
ece = ece_score(logits, labels, n_bins=15)
print(f"ECE: {ece:.4f}") # Lower is better, 0 = perfect calibration
- Interpretation:
ECE < 0.05: Well calibrated
ECE 0.05-0.15: Moderate miscalibration
ECE > 0.15: Severe miscalibration
Maximum Calibration Error (MCE)#
The worst-case calibration error across all bins:
from incerto.calibration import mce_score
mce = mce_score(logits, labels, n_bins=15)
# MCE >= ECE always
Adaptive ECE#
Uses adaptive binning (equal mass per bin) instead of uniform bins:
from incerto.calibration import adaptive_ece_score
adaptive_ece = adaptive_ece_score(logits, labels, n_bins=15)
# More stable for imbalanced datasets
Brier Score#
Measures both calibration and accuracy:
from incerto.calibration import brier_score
bs = brier_score(logits, labels)
# Lower is better, ranges from 0 to 1
Negative Log-Likelihood (NLL)#
Standard loss function, sensitive to miscalibration:
from incerto.calibration import nll
loss = nll(logits, labels)
# Lower is better
Post-Hoc Calibration Methods#
Post-hoc methods calibrate a trained model using a held-out calibration set (typically part of the validation data).
Temperature Scaling#
Best for: Most scenarios, especially with good validation data
The simplest and most effective method. Scales logits by a learned temperature T:
from incerto.calibration import TemperatureScaling
# Fit on validation data
calibrator = TemperatureScaling()
calibrator.fit(val_logits, val_labels)
# Apply to test data
calibrated_dist = calibrator.predict(test_logits)
calibrated_probs = calibrated_dist.probs
# Save for later use
calibrator.save('temperature_calibrator.pt')
print(f"Learned temperature: {calibrator.temperature.item():.4f}")
# T > 1: Model is overconfident (common)
# T < 1: Model is underconfident (rare)
- Advantages:
Single parameter (very stable)
Preserves predicted class (argmax unchanged)
Fast to optimize
- Disadvantages:
Same scaling for all classes
Vector Scaling#
Best for: Class imbalance, when different classes have different confidence patterns
Learns a different temperature for each class:
from incerto.calibration import VectorScaling
calibrator = VectorScaling(n_classes=10)
calibrator.fit(val_logits, val_labels, max_iters=50)
calibrated_dist = calibrator.predict(test_logits)
print(repr(calibrator))
# VectorScaling(n_classes=10, temperature_range=[0.8, 2.1])
- Advantages:
More flexible than temperature scaling
Can handle class-specific miscalibration
- Disadvantages:
More parameters (more validation data needed)
May not preserve argmax predictions
Matrix Scaling#
Best for: When you have lots of validation data and need maximum flexibility
Most general affine transformation: \(z_{calibrated} = W z + b\)
from incerto.calibration import MatrixScaling
calibrator = MatrixScaling(n_classes=10)
calibrator.fit(val_logits, val_labels, max_iters=50)
calibrated_dist = calibrator.predict(test_logits)
- Advantages:
Most flexible parametric method
Can fix complex miscalibration patterns
- Disadvantages:
Many parameters (n_classes²)
Risk of overfitting
Does not preserve argmax
Dirichlet Calibration#
Best for: Research settings, when you need distribution-level calibration
Learns a matrix and bias like matrix scaling but with better theoretical properties:
from incerto.calibration import DirichletCalibrator
# With L2 regularization
calibrator = DirichletCalibrator(n_classes=10, mu=0.01)
calibrator.fit(val_logits, val_labels, max_iters=100)
calibrated_dist = calibrator.predict(test_logits)
Reference: Kull et al., “Beyond temperature scaling” (NeurIPS 2019)
Isotonic Regression#
Best for: Non-parametric, flexible calibration
Fits a monotonic (isotonic) mapping per class:
from incerto.calibration import IsotonicRegressionCalibrator
calibrator = IsotonicRegressionCalibrator(out_of_bounds='clip')
calibrator.fit(val_logits, val_labels)
calibrated_dist = calibrator.predict(test_logits)
- Advantages:
Very flexible (non-parametric)
Can capture complex patterns
- Disadvantages:
Needs more validation data
Risk of overfitting
Can change predicted class
Histogram Binning#
Best for: Research baselines
Bins predictions and uses empirical frequencies:
from incerto.calibration import HistogramBinningCalibrator
calibrator = HistogramBinningCalibrator(n_bins=10)
calibrator.fit(val_logits, val_labels)
calibrated_dist = calibrator.predict(test_logits)
Platt Scaling#
Best for: Binary classification
Logistic regression per class (one-vs-rest for multiclass):
from incerto.calibration import PlattScalingCalibrator
calibrator = PlattScalingCalibrator()
calibrator.fit(val_logits, val_labels)
calibrated_dist = calibrator.predict(test_logits)
Beta Calibration#
Best for: Binary classification
More flexible than Platt scaling, uses Beta distribution:
from incerto.calibration import BetaCalibrator
# For binary classification
logits_binary = torch.randn(100, 2)
labels_binary = torch.randint(0, 2, (100,))
calibrator = BetaCalibrator(method='mle')
calibrator.fit(logits_binary, labels_binary)
calibrated_dist = calibrator.predict(test_logits_binary)
Reference: Kull et al., “Beta calibration” (AISTATS 2017)
Training-Time Calibration#
These methods train models to be calibrated from the start.
Label Smoothing#
Prevents overconfidence by softening one-hot labels:
from incerto.calibration import LabelSmoothingLoss
criterion = LabelSmoothingLoss(smoothing=0.1)
# Training loop
for epoch in range(epochs):
for inputs, targets in train_loader:
outputs = model(inputs)
loss = criterion(outputs, targets)
loss.backward()
optimizer.step()
Effect: Targets become (1-ε) for correct class, ε/(C-1) for others
Focal Loss#
Focuses on hard examples, reduces overconfidence on easy examples:
from incerto.calibration import FocalLoss
criterion = FocalLoss(gamma=2.0) # gamma=0 → standard CE
# Use in training loop
loss = criterion(outputs, targets)
Reference: Lin et al., “Focal Loss” (ICCV 2017)
Confidence Penalty#
Explicitly penalizes overconfident predictions:
from incerto.calibration import ConfidencePenalty
criterion = ConfidencePenalty(beta=0.1)
loss = criterion(outputs, targets)
Complete Calibration Workflow#
Here’s a complete example showing the recommended workflow:
import torch
from torch.utils.data import DataLoader
from incerto.calibration import (
TemperatureScaling,
ece_score,
nll,
brier_score
)
# 1. Train your model normally
model = YourModel()
# ... training code ...
# 2. Split validation data: 50% for model selection, 50% for calibration
val_loader_1, val_loader_2 = split_validation_data()
# Use val_loader_1 for early stopping, model selection, etc.
# 3. Get predictions on calibration set (val_loader_2)
model.eval()
all_logits, all_labels = [], []
with torch.no_grad():
for inputs, labels in val_loader_2:
logits = model(inputs)
all_logits.append(logits)
all_labels.append(labels)
cal_logits = torch.cat(all_logits)
cal_labels = torch.cat(all_labels)
# 4. Measure calibration before
ece_before = ece_score(cal_logits, cal_labels)
print(f"ECE before calibration: {ece_before:.4f}")
# 5. Fit calibrator
calibrator = TemperatureScaling()
calibrator.fit(cal_logits, cal_labels)
# 6. Save calibrator
calibrator.save('calibrator.pt')
# 7. Measure calibration after
calibrated = calibrator.predict(cal_logits)
ece_after = ece_score(calibrated.logits, cal_labels)
print(f"ECE after calibration: {ece_after:.4f}")
# 8. Use on test set
test_logits = model(test_inputs)
test_calibrated = calibrator.predict(test_logits)
test_probs = test_calibrated.probs # Use these for decisions
Visualization#
Reliability diagrams show calibration visually:
from incerto.utils.visualization import plot_reliability_diagram
import matplotlib.pyplot as plt
# Create reliability diagram
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
# Before calibration
plot_reliability_diagram(logits, labels, n_bins=10, ax=ax[0])
ax[0].set_title('Before Calibration')
# After calibration
calibrated = calibrator.predict(logits)
plot_reliability_diagram(calibrated.logits, labels, n_bins=10, ax=ax[1])
ax[1].set_title('After Calibration')
plt.tight_layout()
plt.savefig('calibration_comparison.png')
A well-calibrated model shows points near the diagonal.
Best Practices#
- Always use a separate calibration set
Never calibrate on training data or the same validation data used for early stopping
- Start with temperature scaling
It’s simple, effective, and rarely overfits
- Use enough calibration data
Temperature scaling: 500+ samples minimum
Vector scaling: 1000+ samples
Matrix scaling: 5000+ samples
Isotonic regression: 2000+ samples
- Evaluate on a held-out test set
Calibration can overfit just like accuracy
- Monitor multiple metrics
Use ECE, MCE, and NLL together for a complete picture
- Save calibrators with your models
Always deploy calibrated predictions to production
- Re-calibrate when data shifts
If deployment data differs from training, re-fit calibrator
Common Pitfalls#
- ❌ Calibrating on training data
Will appear perfect but won’t generalize
- ❌ Using too complex calibrators with little data
Matrix scaling with 500 samples will overfit
- ❌ Ignoring class imbalance
Use classwise-ECE to detect per-class issues
- ❌ Not saving calibrators
Model + calibrator should be versioned together
- ❌ Expecting perfect calibration
ECE < 0.05 is excellent; perfect calibration is impossible
Comparison of Methods#
Method |
Parameters |
Data Needed |
Preserves Argmax? |
Best Use Case |
|---|---|---|---|---|
Temperature Scaling |
1 |
500+ |
✅ Yes |
Default choice, most scenarios |
Vector Scaling |
C |
1000+ |
❌ No |
Class imbalance |
Matrix Scaling |
C² |
5000+ |
❌ No |
Lots of data, complex patterns |
Dirichlet |
C² |
5000+ |
❌ No |
Research, distribution calibration |
Isotonic |
Non-param |
2000+ |
❌ No |
Flexible, non-linear patterns |
Platt Scaling |
2C |
1000+ |
❌ No |
Binary classification |
Beta |
Non-param |
1000+ |
❌ No |
Binary classification |
References#
Guo et al., “On Calibration of Modern Neural Networks” (ICML 2017)
Kull et al., “Beyond temperature scaling” (NeurIPS 2019)
Kull et al., “Beta calibration” (AISTATS 2017)
Müller et al., “When does label smoothing help?” (NeurIPS 2019)
Nixon et al., “Measuring Calibration in Deep Learning” (CVPR 2019)
See Also#
Calibration - Complete API reference
Quick Start - Quick start guide
Conformal Prediction Guide - Prediction sets with coverage guarantees
Selective Prediction Guide - Selective prediction with abstention