LLM Uncertainty#

The LLM module provides uncertainty quantification methods specifically designed for large language models.

Token-level Uncertainty#

TokenEntropy()

Compute predictive entropy at each token position.

TokenConfidence()

Maximum softmax probability at each token position.

TokenPerplexity()

Perplexity at each token position.

SurprisalScore()

Surprisal (negative log-probability) of generated tokens.

TopKConfidence()

Confidence based on probability mass in top-k tokens.

Sequence-level Uncertainty#

SequenceProbability()

Joint probability of the entire sequence.

AverageLogProb()

Mean log-probability across the sequence.

NormalizedSequenceProb()

Length-normalized sequence probability.

SequenceEntropy()

Aggregated entropy over the sequence.

SequencePerplexity()

Perplexity of the entire sequence.

VarianceOfTokenProbs()

Variance of token probabilities across the sequence.

Sampling-based Uncertainty#

SelfConsistency()

Self-consistency via majority voting across samples.

LexicalSimilarity()

Measure lexical similarity across samples.

VarianceRatio()

Variance ratio for classification/multiple choice.

PredictiveEntropy()

Predictive entropy across multiple sampled sequences.

MutualInformation()

Mutual information between predictions and model (aleatoric vs epistemic).

SemanticEntropy()

Semantic entropy - entropy over semantically clustered responses.

EnsembleDisagreement()

Disagreement rate across an ensemble of models or sampling strategies.

Generation Methods#

BeamSearchUncertainty()

Uncertainty estimation from beam search scores.

NucleusSamplingUncertainty()

Uncertainty for nucleus (top-p) sampling.

IDontKnowDetection()

Detect when the model is expressing uncertainty verbally.

ContrastiveDecoding()

Uncertainty from contrastive decoding (comparing expert vs amateur models).

Verbalized Uncertainty#

VerbalizedConfidence()

Ask the model to verbalize its confidence.

PTrue()

P(True) - asking the model the probability its answer is correct.

SelfEvaluation()

Multi-turn self-critique for uncertainty.

BidirectionalConsistency()

Check consistency by asking the question in different ways.

Calibration#

TokenTemperatureScaling([init_temp])

Temperature scaling for token-level probabilities.

SequenceLengthCalibration([alpha])

Calibrate for length bias in sequence probabilities.

VerbosityBiasCorrection()

Correct for the model's tendency to be more confident on verbose outputs.

HistogramBinning([n_bins])

Histogram binning calibration for LLM confidence scores.

Metrics#

selective_accuracy(predictions, targets, ...)

Compute accuracy on high-confidence predictions.

calibration_error(confidences, correctness)

Compute Expected Calibration Error (ECE) and Maximum Calibration Error (MCE).

brier_score(confidences, correctness)

Compute Brier score for binary correctness prediction.

aur_c(confidences, correctness)

Area Under Risk-Coverage curve.

uncertainty_auc(uncertainties, correctness)

AUC for using uncertainty to filter incorrect predictions.

token_level_accuracy(pred_tokens, true_tokens)

Compute token-level accuracy.

sequence_level_accuracy(pred_sequences, ...)

Compute sequence-level exact match accuracy.

f1_score_tokens(pred_tokens, true_tokens[, mask])

Compute precision, recall, and F1 at token level.

Visualization#

plot_token_uncertainty(tokens, uncertainties)

Plot uncertainty as a heatmap over token sequence.

plot_confidence_vs_correctness(confidences, ...)

Plot calibration diagram showing confidence vs.

plot_generation_diversity(responses[, ...])

Visualize diversity of generated responses.

plot_semantic_clusters(responses, clusters)

Visualize semantic clustering of responses.

plot_risk_coverage_llm(confidences, correctness)

Plot risk-coverage curve for selective prediction.

plot_uncertainty_distribution(uncertainties)

Plot distribution of uncertainty scores.

plot_length_vs_confidence(lengths, confidences)

Plot relationship between sequence length and confidence.