LLM Uncertainty Quantification Guide#

Large Language Models (LLMs) need special uncertainty quantification methods due to their autoregressive nature and scale.

Why LLM Uncertainty Matters#

LLMs can be confidently wrong:

Hallucinations: Generate plausible but false information
Factual errors: Incorrect facts stated with high confidence
Out-of-domain: Uncertain when query is outside training

Uncertainty helps:

Detect hallucinations
Trigger fact-checking
Decide when to abstain

Unique Challenges#

Autoregressive generation:: Uncertainty compounds over sequence
No explicit probabilities:: Many models don’t expose logits
Scale:: Billions of parameters, expensive to run multiple times
Calibration:: Model confidence often poor

Methods#

Token-Level Entropy#

Best for: Simple baseline, available for any model

Measure entropy at each token:

from incerto.llm import TokenEntropy

# Get logits from model
logits = model(input_ids)  # (batch, seq_len, vocab)

# Compute token-level entropy
token_entropy = TokenEntropy.compute(logits)

# Average entropy per sequence
avg_entropy = token_entropy.mean(dim=1)

print(f"Average entropy: {avg_entropy}")
# Higher entropy = more uncertain

Interpretation:

Low entropy: Model confident in next token
High entropy: Model uncertain

Self-Consistency#

Best for: Question answering, factual queries

Sample multiple responses and measure agreement:

from incerto.llm import SelfConsistency

# Generate multiple responses
responses = [
    model.generate(prompt, do_sample=True, temperature=1.0)
    for _ in range(10)
]

# Compute self-consistency metrics
result = SelfConsistency.compute(responses)

print(f"Agreement rate: {result['agreement_rate']:.2f}")
print(f"Unique answers: {result['num_unique']}")
print(f"Most common: {result['most_common']}")

# High agreement = high confidence
if result['agreement_rate'] > 0.7:
    print("Model is confident")
else:
    print("Model is uncertain - consider fact-checking")

Advantages:

Model-agnostic
Captures semantic uncertainty
Effective for factual queries

Disadvantages:

Expensive (multiple generations)
Requires sampling capability

Reference: Wang et al., “Self-Consistency Improves Chain of Thought Reasoning” (ICLR 2023)

Semantic Entropy#

Best for: Measuring meaning-level uncertainty

Group semantically equivalent responses:

from incerto.llm import SemanticEntropy

responses = generate_multiple_responses(prompt, n=10)

# Compute semantic entropy (groups similar answers)
semantic_ent = SemanticEntropy.compute(
    responses,
    similarity_threshold=0.8
)

print(f"Semantic entropy: {semantic_ent:.4f}")
# Accounts for paraphrasing

Verbalized Uncertainty#

Best for: Modern chat models with instruction following

Ask model to express uncertainty:

from incerto.llm import VerbalizedUncertainty

# Prompt model to express uncertainty
prompt = """
Answer the question and express your confidence (0-100%).

Question: {question}
Answer with format: [Answer] | Confidence: [0-100]
"""

response = model.generate(prompt.format(question=query))

# Parse confidence
uncertainty = VerbalizedUncertainty.parse(response)

print(f"Model confidence: {uncertainty['confidence']}")
print(f"Answer: {uncertainty['answer']}")

Advantages:

Natural for chat models
Fast (single forward pass)

Disadvantages:

Requires instruction-tuned model
May be miscalibrated

Sequence-Level Metrics#

Aggregate token probabilities across sequence:

from incerto.llm import SequenceLevelUncertainty

# Generate with log probabilities
output = model.generate(
    input_ids,
    return_dict_in_generate=True,
    output_scores=True
)

# Compute sequence-level metrics
metrics = SequenceLevelUncertainty.compute(output.scores)

print(f"Mean log probability: {metrics['mean_logprob']:.4f}")
print(f"Perplexity: {metrics['perplexity']:.4f}")
print(f"Entropy: {metrics['entropy']:.4f}")

Complete Example#

Uncertainty-aware generation:

from incerto.llm import SelfConsistency, TokenEntropy
import torch

def generate_with_uncertainty(model, prompt, n_samples=10):
    """Generate answer with uncertainty estimate."""

    # 1. Generate multiple responses
    responses = []
    all_entropies = []

    for _ in range(n_samples):
        output = model.generate(
            prompt,
            do_sample=True,
            temperature=1.0,
            return_dict_in_generate=True,
            output_scores=True
        )

        response = tokenizer.decode(output.sequences[0])
        responses.append(response)

        # Compute token entropy
        logits = torch.stack(output.scores, dim=1)
        entropy = TokenEntropy.compute(logits).mean()
        all_entropies.append(entropy.item())

    # 2. Measure self-consistency
    consistency = SelfConsistency.compute(responses)

    # 3. Aggregate uncertainties
    uncertainty_score = {
        'agreement_rate': consistency['agreement_rate'],
        'num_unique': consistency['num_unique'],
        'avg_entropy': sum(all_entropies) / len(all_entropies),
        'most_common_answer': consistency['most_common']
    }

    return uncertainty_score

# Use it
result = generate_with_uncertainty(model, "What is the capital of France?")

if result['agreement_rate'] > 0.8:
    print(f"High confidence answer: {result['most_common_answer']}")
else:
    print(f"Low confidence - {result['num_unique']} different answers")
    print("Consider fact-checking or abstaining")

Hallucination Detection#

Combine uncertainty signals to detect hallucinations:

def detect_hallucination(model, prompt, claim):
    """Check if model's claim is likely hallucinated."""

    # 1. Generate multiple times
    responses = [
        model.generate(prompt, do_sample=True)
        for _ in range(10)
    ]

    # 2. Check consistency
    consistency = SelfConsistency.compute(responses)

    # 3. Check if claim appears in responses
    claim_freq = sum(claim in r for r in responses) / len(responses)

    # Decision logic
    if consistency['agreement_rate'] < 0.3:
        return "LIKELY HALLUCINATION - Low consistency"

    if claim_freq < 0.5:
        return "POSSIBLE HALLUCINATION - Claim not robust"

    return "LIKELY CORRECT - High confidence"

# Example
prompt = "Who won the Nobel Prize in Physics in 2099?"
claim = "Alice Johnson"  # Hallucinated future event

result = detect_hallucination(model, prompt, claim)
print(result)  # "LIKELY HALLUCINATION"

Best Practices#

Use multiple uncertainty signals
Combine token entropy, self-consistency, perplexity
Calibrate on validation set
LLM uncertainties often miscalibrated
Consider computational cost
Self-consistency requires multiple generations
Domain-specific thresholds
Factual queries vs. creative tasks need different thresholds
Validate with human evaluation
Check if uncertainty correlates with correctness
Use for selective generation
Abstain or request human review when uncertain

Applications#

Question Answering:: Detect when model doesn’t know answer
Fact Checking:: Flag claims for verification
Content Moderation:: Review uncertain generations manually
Interactive Systems:: Ask clarifying questions when uncertain
Retrieval-Augmented Generation:: Retrieve more context when uncertain

References#

Wang et al., “Self-Consistency Improves Chain of Thought Reasoning in LLMs” (ICLR 2023)
Kadavath et al., “Language Models (Mostly) Know What They Know” (2022)
Kuhn et al., “Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation” (ICLR 2023)
Lin et al., “Teaching Models to Express Their Uncertainty” (2022)

LLM Uncertainty Quantification Guide

Contents