LLM Uncertainty Quantification Guide#
Large Language Models (LLMs) need special uncertainty quantification methods due to their autoregressive nature and scale.
Why LLM Uncertainty Matters#
- LLMs can be confidently wrong:
Hallucinations: Generate plausible but false information
Factual errors: Incorrect facts stated with high confidence
Out-of-domain: Uncertain when query is outside training
- Uncertainty helps:
Detect hallucinations
Trigger fact-checking
Decide when to abstain
Unique Challenges#
- Autoregressive generation:
Uncertainty compounds over sequence
- No explicit probabilities:
Many models don’t expose logits
- Scale:
Billions of parameters, expensive to run multiple times
- Calibration:
Model confidence often poor
Methods#
Token-Level Entropy#
Best for: Simple baseline, available for any model
Measure entropy at each token:
from incerto.llm import TokenEntropy
# Get logits from model
logits = model(input_ids) # (batch, seq_len, vocab)
# Compute token-level entropy
token_entropy = TokenEntropy.compute(logits)
# Average entropy per sequence
avg_entropy = token_entropy.mean(dim=1)
print(f"Average entropy: {avg_entropy}")
# Higher entropy = more uncertain
- Interpretation:
Low entropy: Model confident in next token
High entropy: Model uncertain
Self-Consistency#
Best for: Question answering, factual queries
Sample multiple responses and measure agreement:
from incerto.llm import SelfConsistency
# Generate multiple responses
responses = [
model.generate(prompt, do_sample=True, temperature=1.0)
for _ in range(10)
]
# Compute self-consistency metrics
result = SelfConsistency.compute(responses)
print(f"Agreement rate: {result['agreement_rate']:.2f}")
print(f"Unique answers: {result['num_unique']}")
print(f"Most common: {result['most_common']}")
# High agreement = high confidence
if result['agreement_rate'] > 0.7:
print("Model is confident")
else:
print("Model is uncertain - consider fact-checking")
- Advantages:
Model-agnostic
Captures semantic uncertainty
Effective for factual queries
- Disadvantages:
Expensive (multiple generations)
Requires sampling capability
Reference: Wang et al., “Self-Consistency Improves Chain of Thought Reasoning” (ICLR 2023)
Semantic Entropy#
Best for: Measuring meaning-level uncertainty
Group semantically equivalent responses:
from incerto.llm import SemanticEntropy
responses = generate_multiple_responses(prompt, n=10)
# Compute semantic entropy (groups similar answers)
semantic_ent = SemanticEntropy.compute(
responses,
similarity_threshold=0.8
)
print(f"Semantic entropy: {semantic_ent:.4f}")
# Accounts for paraphrasing
Verbalized Uncertainty#
Best for: Modern chat models with instruction following
Ask model to express uncertainty:
from incerto.llm import VerbalizedUncertainty
# Prompt model to express uncertainty
prompt = """
Answer the question and express your confidence (0-100%).
Question: {question}
Answer with format: [Answer] | Confidence: [0-100]
"""
response = model.generate(prompt.format(question=query))
# Parse confidence
uncertainty = VerbalizedUncertainty.parse(response)
print(f"Model confidence: {uncertainty['confidence']}")
print(f"Answer: {uncertainty['answer']}")
- Advantages:
Natural for chat models
Fast (single forward pass)
- Disadvantages:
Requires instruction-tuned model
May be miscalibrated
Sequence-Level Metrics#
Aggregate token probabilities across sequence:
from incerto.llm import SequenceLevelUncertainty
# Generate with log probabilities
output = model.generate(
input_ids,
return_dict_in_generate=True,
output_scores=True
)
# Compute sequence-level metrics
metrics = SequenceLevelUncertainty.compute(output.scores)
print(f"Mean log probability: {metrics['mean_logprob']:.4f}")
print(f"Perplexity: {metrics['perplexity']:.4f}")
print(f"Entropy: {metrics['entropy']:.4f}")
Complete Example#
Uncertainty-aware generation:
from incerto.llm import SelfConsistency, TokenEntropy
import torch
def generate_with_uncertainty(model, prompt, n_samples=10):
"""Generate answer with uncertainty estimate."""
# 1. Generate multiple responses
responses = []
all_entropies = []
for _ in range(n_samples):
output = model.generate(
prompt,
do_sample=True,
temperature=1.0,
return_dict_in_generate=True,
output_scores=True
)
response = tokenizer.decode(output.sequences[0])
responses.append(response)
# Compute token entropy
logits = torch.stack(output.scores, dim=1)
entropy = TokenEntropy.compute(logits).mean()
all_entropies.append(entropy.item())
# 2. Measure self-consistency
consistency = SelfConsistency.compute(responses)
# 3. Aggregate uncertainties
uncertainty_score = {
'agreement_rate': consistency['agreement_rate'],
'num_unique': consistency['num_unique'],
'avg_entropy': sum(all_entropies) / len(all_entropies),
'most_common_answer': consistency['most_common']
}
return uncertainty_score
# Use it
result = generate_with_uncertainty(model, "What is the capital of France?")
if result['agreement_rate'] > 0.8:
print(f"High confidence answer: {result['most_common_answer']}")
else:
print(f"Low confidence - {result['num_unique']} different answers")
print("Consider fact-checking or abstaining")
Hallucination Detection#
Combine uncertainty signals to detect hallucinations:
def detect_hallucination(model, prompt, claim):
"""Check if model's claim is likely hallucinated."""
# 1. Generate multiple times
responses = [
model.generate(prompt, do_sample=True)
for _ in range(10)
]
# 2. Check consistency
consistency = SelfConsistency.compute(responses)
# 3. Check if claim appears in responses
claim_freq = sum(claim in r for r in responses) / len(responses)
# Decision logic
if consistency['agreement_rate'] < 0.3:
return "LIKELY HALLUCINATION - Low consistency"
if claim_freq < 0.5:
return "POSSIBLE HALLUCINATION - Claim not robust"
return "LIKELY CORRECT - High confidence"
# Example
prompt = "Who won the Nobel Prize in Physics in 2099?"
claim = "Alice Johnson" # Hallucinated future event
result = detect_hallucination(model, prompt, claim)
print(result) # "LIKELY HALLUCINATION"
Best Practices#
- Use multiple uncertainty signals
Combine token entropy, self-consistency, perplexity
- Calibrate on validation set
LLM uncertainties often miscalibrated
- Consider computational cost
Self-consistency requires multiple generations
- Domain-specific thresholds
Factual queries vs. creative tasks need different thresholds
- Validate with human evaluation
Check if uncertainty correlates with correctness
- Use for selective generation
Abstain or request human review when uncertain
Applications#
- Question Answering:
Detect when model doesn’t know answer
- Fact Checking:
Flag claims for verification
- Content Moderation:
Review uncertain generations manually
- Interactive Systems:
Ask clarifying questions when uncertain
- Retrieval-Augmented Generation:
Retrieve more context when uncertain
References#
Wang et al., “Self-Consistency Improves Chain of Thought Reasoning in LLMs” (ICLR 2023)
Kadavath et al., “Language Models (Mostly) Know What They Know” (2022)
Kuhn et al., “Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation” (ICLR 2023)
Lin et al., “Teaching Models to Express Their Uncertainty” (2022)
See Also#
LLM Uncertainty - Complete API reference
Calibration Guide - Calibration for language models
Selective Prediction Guide - Selective generation