Active Learning Guide#
Active learning reduces labeling costs by strategically selecting which samples to label. Instead of random sampling, query the most informative examples.
Why Active Learning#
- Labeling is expensive:
Medical image annotation requires expert radiologists
NLP tasks need careful human review
Robotics needs real-world interaction
Active learning can achieve same performance with 10-100x less labeled data.
Core Idea#
Train model on small labeled set
Query strategy: Select most informative unlabeled samples
Get labels for selected samples (human annotation)
Add to training set, retrain
Repeat until budget exhausted or performance adequate
Acquisition Functions#
Acquisition functions score how informative each unlabeled sample is.
Entropy Acquisition#
Best for: Starting point, simple and effective
Query samples where model is most uncertain:
from incerto.active import EntropyAcquisition, UncertaintySampling
# Create acquisition function
acquisition = EntropyAcquisition()
# Use with uncertainty sampling strategy
strategy = UncertaintySampling(
acquisition_fn=acquisition,
batch_size=100
)
# Query most uncertain samples
query_indices = strategy.query(model, unlabeled_data)
# Label these samples
samples_to_label = unlabeled_data[query_indices]
Least Confidence#
Query samples where model is least confident in its prediction:
from incerto.active import LeastConfidenceAcquisition, UncertaintySampling
acquisition = LeastConfidenceAcquisition()
strategy = UncertaintySampling(acquisition, batch_size=100)
query_indices = strategy.query(model, unlabeled_data)
Margin Sampling#
Query samples with smallest margin between top-2 predictions:
from incerto.active import MarginAcquisition, UncertaintySampling
acquisition = MarginAcquisition()
strategy = UncertaintySampling(acquisition, batch_size=100)
query_indices = strategy.query(model, unlabeled_data)
BALD (Bayesian Active Learning by Disagreement)#
Best for: When using Bayesian methods (MC Dropout, ensembles)
Query samples with highest mutual information:
from incerto.active import BALDAcquisition, UncertaintySampling
# BALD uses multiple forward passes for MC Dropout
acquisition = BALDAcquisition(num_samples=10)
strategy = UncertaintySampling(acquisition, batch_size=100)
# Model should have dropout enabled
query_indices = strategy.query(model, unlabeled_data)
Intuition: Query where model weights disagree most (high epistemic uncertainty)
Reference: Houlsby et al., “Bayesian Active Learning for Classification” (AIStats 2011)
Query Strategies#
Uncertainty Sampling#
Simple top-k selection based on acquisition scores:
from incerto.active import EntropyAcquisition, UncertaintySampling
acquisition = EntropyAcquisition()
strategy = UncertaintySampling(acquisition, batch_size=100)
# Returns indices of top 100 most uncertain samples
indices = strategy.query(model, unlabeled_data)
Diversity Sampling#
Balance uncertainty with diversity to avoid redundant samples:
from incerto.active import EntropyAcquisition, DiversitySampling
acquisition = EntropyAcquisition()
strategy = DiversitySampling(
acquisition_fn=acquisition,
batch_size=100,
diversity_weight=0.5 # Balance uncertainty and diversity
)
indices = strategy.query(model, unlabeled_data)
CoreSet Selection#
Select samples that best cover the feature space:
from incerto.active import CoreSetSelection
strategy = CoreSetSelection(batch_size=100)
# Requires features (can extract from model)
indices = strategy.query(
features_unlabeled,
features_labeled
)
BADGE Sampling#
Diverse gradients for batch selection:
from incerto.active import BadgeSampling
strategy = BadgeSampling(batch_size=100)
indices = strategy.query(model, unlabeled_data)
Reference: Ash et al., “Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds” (ICLR 2020)
Complete Active Learning Loop#
import torch
from incerto.active import EntropyAcquisition, UncertaintySampling
# Manual loop example
labeled_indices = initial_labeled_indices
unlabeled_indices = torch.arange(len(dataset))
unlabeled_indices = unlabeled_indices[~torch.isin(unlabeled_indices, labeled_indices)]
acquisition = EntropyAcquisition()
strategy = UncertaintySampling(acquisition, batch_size=100)
for round in range(10):
# Train model on labeled data
model = train_model(dataset, labeled_indices)
# Get unlabeled samples
unlabeled_data = dataset[unlabeled_indices][0] # features only
# Query most informative samples
query_local_indices = strategy.query(model, unlabeled_data)
# Map back to global indices
query_global_indices = unlabeled_indices[query_local_indices]
# Update labeled/unlabeled sets
labeled_indices = torch.cat([labeled_indices, query_global_indices])
unlabeled_mask = ~torch.isin(unlabeled_indices, query_global_indices)
unlabeled_indices = unlabeled_indices[unlabeled_mask]
# Evaluate
accuracy = evaluate(model, test_loader)
print(f"Round {round+1}: {len(labeled_indices)} labeled, accuracy={accuracy:.2%}")
You can also use the utility function:
from incerto.active import active_learning_loop
# Run active learning with custom training function
results = active_learning_loop(
model=model,
train_loader=train_loader,
unlabeled_loader=unlabeled_loader,
acquisition=EntropyAcquisition(),
train_fn=train_fn,
n_rounds=10,
samples_per_round=100,
)
Practical Tips#
- Batch mode:
Query multiple samples at once for efficiency
# Configure batch size in strategy
strategy = UncertaintySampling(acquisition, batch_size=100)
- Diversity:
Use DiversitySampling to avoid querying similar samples
from incerto.active import DiversitySampling, EntropyAcquisition
strategy = DiversitySampling(
acquisition_fn=EntropyAcquisition(),
batch_size=100,
diversity_weight=0.5 # 0.5 = equal weight to uncertainty and diversity
)
- Cold start:
Begin with random sampling or stratified sampling
# Initial random sample
initial_size = 100
initial_indices = torch.randperm(len(dataset))[:initial_size]
- Stopping criteria:
Stop when performance plateaus or budget exhausted
if accuracy > target_accuracy:
print("Target accuracy reached!")
break
if len(labeled_indices) >= max_budget:
print("Budget exhausted!")
break
Evaluation#
- Learning curve:
Plot accuracy vs. number of labeled samples
import matplotlib.pyplot as plt
plt.plot(n_labeled_samples, accuracies, label='Active')
plt.plot(n_labeled_samples, random_accuracies, label='Random')
plt.xlabel('Number of Labeled Samples')
plt.ylabel('Test Accuracy')
plt.legend()
- Area Under Learning Curve (AULC):
Higher is better
- Reduction ratio:
How much data saved to reach target accuracy
Best Practices#
- Start with uncertainty sampling
Simple, effective baseline
- Use batch queries
Query 50-100 samples at a time for efficiency
- Consider diversity
Prevent querying redundant samples
- Retrain frequently
Model needs to adapt to new labels
- Use Bayesian methods when possible
BALD often outperforms simple uncertainty
- Compare to random baseline
Always benchmark against random sampling
Common Pitfalls#
Querying only hardest samples: Can lead to noisy/outlier labels
Not using diversity: Queries may be redundant
Infrequent retraining: Model doesn’t benefit from new labels
Wrong initial set: Cold start matters - use stratified sampling
Advanced Topics#
- Query by committee:
Use ensemble disagreement:
from incerto.active import QueryByCommittee
# Committee of models
committee = [model1, model2, model3]
strategy = QueryByCommittee(committee, batch_size=100)
indices = strategy.query(unlabeled_data, model=committee[0])
References#
Settles, “Active Learning Literature Survey” (2009)
Houlsby et al., “Bayesian Active Learning for Classification” (AIStats 2011)
Gal et al., “Deep Bayesian Active Learning” (ICML 2017)
Ash et al., “Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds” (ICLR 2020)
See Also#
Active Learning - Complete API reference
Bayesian Deep Learning Guide - Bayesian uncertainty for BALD
Selective Prediction Guide - Selective prediction