LLM Uncertainty#
The LLM module provides uncertainty quantification methods specifically designed for large language models.
Token-level Uncertainty#
Compute predictive entropy at each token position. |
|
Maximum softmax probability at each token position. |
|
Perplexity at each token position. |
|
Surprisal (negative log-probability) of generated tokens. |
|
Confidence based on probability mass in top-k tokens. |
Sequence-level Uncertainty#
Joint probability of the entire sequence. |
|
Mean log-probability across the sequence. |
|
Length-normalized sequence probability. |
|
Aggregated entropy over the sequence. |
|
Perplexity of the entire sequence. |
|
Variance of token probabilities across the sequence. |
Sampling-based Uncertainty#
Self-consistency via majority voting across samples. |
|
Measure lexical similarity across samples. |
|
Variance ratio for classification/multiple choice. |
|
Predictive entropy across multiple sampled sequences. |
|
Mutual information between predictions and model (aleatoric vs epistemic). |
|
Semantic entropy - entropy over semantically clustered responses. |
|
Disagreement rate across an ensemble of models or sampling strategies. |
Generation Methods#
Uncertainty estimation from beam search scores. |
|
Uncertainty for nucleus (top-p) sampling. |
|
Detect when the model is expressing uncertainty verbally. |
|
Uncertainty from contrastive decoding (comparing expert vs amateur models). |
Verbalized Uncertainty#
Ask the model to verbalize its confidence. |
|
|
P(True) - asking the model the probability its answer is correct. |
Multi-turn self-critique for uncertainty. |
|
Check consistency by asking the question in different ways. |
Calibration#
|
Temperature scaling for token-level probabilities. |
|
Calibrate for length bias in sequence probabilities. |
Correct for the model's tendency to be more confident on verbose outputs. |
|
|
Histogram binning calibration for LLM confidence scores. |
Metrics#
|
Compute accuracy on high-confidence predictions. |
|
Compute Expected Calibration Error (ECE) and Maximum Calibration Error (MCE). |
|
Compute Brier score for binary correctness prediction. |
|
Area Under Risk-Coverage curve. |
|
AUC for using uncertainty to filter incorrect predictions. |
|
Compute token-level accuracy. |
|
Compute sequence-level exact match accuracy. |
|
Compute precision, recall, and F1 at token level. |
Visualization#
|
Plot uncertainty as a heatmap over token sequence. |
|
Plot calibration diagram showing confidence vs. |
|
Visualize diversity of generated responses. |
|
Visualize semantic clustering of responses. |
|
Plot risk-coverage curve for selective prediction. |
|
Plot distribution of uncertainty scores. |
|
Plot relationship between sequence length and confidence. |