Evaluating token-specific performance

Evaluating token-specific performance#

Sometimes, you might be interested in seeing how the model performs on a specific subset of characters. Stringalign provides a couple of useful utilities for this.

import stringalign

references = [
    "Snekkermester Thor Bjørklund fra Øvre Smestad i Fåberg patenterte Ostehøvelen i 1925.",
    "Snøen smeltet i vårsola.",
    "Ved å blande blått og gult kan du få grønt.",
    "Det var et ærlig forsøk",
]
predictions = [
    "Snekkermester Thor Bjorklund fra Ovre Smestad i Faberg patenterte Ostehovelen i 1925.",
    "Snoen smeltet i varsola",
    "Ved a blande blatt og gult kan du fa grønt.",
    "Det var et aerlig forsok",
]

tokenizer = stringalign.tokenize.GraphemeClusterTokenizer()
analyzer = stringalign.evaluate.MultiAlignmentAnalyzer.from_strings(references, predictions, tokenizer)
cm = analyzer.confusion_matrix

cer = cm.compute_token_error_rate()
print(f"The CER is {cer}")
The CER is 0.07428571428571429

Next, we can compute token-specific statistics to see how the model performs on specific characters:

sensitivity = cm.compute_sensitivity()
precision = cm.compute_precision()
f1_scores = cm.compute_f1_score()

for character in "æøå":
    print(f"Statistics for {character}:")
    print(f"Sensitivity: {sensitivity[character]}")
    print(f"Precision:   {precision[character]}")
    print(f"F1 score:    {f1_scores[character]}")
    print()
Statistics for æ:
Sensitivity: 0.0
Precision:   nan
F1 score:    0.0

Statistics for ø:
Sensitivity: 0.2
Precision:   1.0
F1 score:    0.33333333333333337

Statistics for å:
Sensitivity: 0.0
Precision:   nan
F1 score:    0.0

We see that the precision can be nan, this happens for all tokens not in the predicted string, as the precision is defined by the number of times a given token was correctly identified (true_positives) divided by the number of times the token was predicted (true_positives + false_positives). If a token never occurs in the predicted string, then this quantity is ill-defined, and becomes nan.

Aggregating the summary statistics#

We can also aggregate the number of true positives, false positives and false negatives for multiple tokens to get an overall measure for Norwegian special characters

overall_sensitivity = cm.compute_sensitivity(aggregate_over="æøå")
overall_precision = cm.compute_precision(aggregate_over="æøå")
overall_f1 = cm.compute_f1_score(aggregate_over="æøå")

print(f"The overall sensitivity for æ, ø and å is: {overall_sensitivity}")
print(f"The overall precision for æ, ø and å is:   {overall_precision}")
print(f"The overall F1 score for æ, ø and å is:    {overall_f1}")
The overall sensitivity for æ, ø and å is: 0.09090909090909091
The overall precision for æ, ø and å is:   1.0
The overall F1 score for æ, ø and å is:    0.16666666666666669

Total running time of the script: (0 minutes 0.006 seconds)

Gallery generated by Sphinx-Gallery