incerto.llm.f1_score_tokens

incerto.llm.f1_score_tokens#

incerto.llm.f1_score_tokens(pred_tokens, true_tokens, mask=None)[source]#

Compute precision, recall, and F1 at token level.

This treats token prediction as a retrieval problem where: - True Positive: correct token at a valid (masked-in) position - False Positive: wrong token at a valid position - False Negative: true token at a masked-out position (not predicted)

When mask covers all positions (default), FN=0 and recall=1.0, making F1 equal to 2*precision/(1+precision). In this case, consider using token_level_accuracy() instead.

Parameters:
  • pred_tokens (Tensor) – Predicted token IDs (batch, seq_len)

  • true_tokens (Tensor) – True token IDs (batch, seq_len)

  • mask (Optional[Tensor]) – Optional mask for valid positions. Positions where mask=0 contribute to false negatives (tokens that should have been predicted but weren’t).

Return type:

dict

Returns:

Dictionary with precision, recall, F1, and token counts