Worked example: Inspecting an HTR model#
In this example, we will see how Stringalign can be used to inspect the performance of a handwritten text recognition (HTR) model (Sprakbanken/TrOCR-norhand-v3) on the test set of the Teklia/NorHand-v3 dataset.
This analysis will focus on word accuracies, as that’s were we’ve found the most interesting results for this model, but you can use the exact same strategy to investigate character errors as well.
Imports and loading the data#
[1]:
from collections import Counter
import pandas as pd
import stringalign
from IPython.display import HTML, display
from ipywidgets import interact
from stringalign.evaluate import MultiAlignmentAnalyzer
[2]:
dataset = pd.read_json("data/transcription.json")
Create the multi-alignment analyzer#
First, we create the multi-alignment analyzer. Notice that we also include metadata this time. This makes it easy to cross reference between the alignment analyzer instances and the rows in the dataset.
[3]:
multi_analyzer = MultiAlignmentAnalyzer.from_strings(
references=dataset["reference"],
predictions=dataset["predicted"],
metadata=[{"img_path": row.img, "index": row.Index} for row in dataset.itertuples()],
tokenizer=stringalign.tokenize.UnicodeWordTokenizer(),
)
Compute WER#
The first thing we do is compute the token error rate, which in this case equates to the WER
[4]:
print(f"The WER is: {multi_analyzer.compute_ter():.2%}")
The WER is: 15.05%
Find most common edits#
After computing the WER, it can be useful to see the most common edits made by the algorithm.
[5]:
for edit, count in multi_analyzer.edit_counts["raw"].most_common(5):
print(f"Edit {edit} occured {count} times.")
Edit REPLACED 'jej' -> 'jeg' occured 7 times.
Edit REPLACED 'Ivar' -> 'Svar' occured 7 times.
Edit REPLACED 'eg' -> 'og' occured 7 times.
Edit REPLACED 'De' -> 'de' occured 4 times.
Edit DELETED '1' occured 3 times.
Accounting for multiple possible optimal alignments#
As discussed in Sequence alignment, the optimal alignment is not unique, so we compute can double check the stability by randomising the selected optimal alignments and computing average counts.
In reality, it would be nice to compute the standard deviation of the counts as well, but we’ll skip that here. See e.g. Uncertainty metrics for token-specific statistics for an example where we compute the standard deviation for edit operation counts.
[6]:
num_random_samples = 10
multi_analyzers = [
MultiAlignmentAnalyzer.from_strings(
references=dataset["reference"],
predictions=dataset["predicted"],
metadata=[{"img_path": row.img, "index": row.Index} for row in dataset.itertuples()],
tokenizer=stringalign.tokenize.UnicodeWordTokenizer(),
randomize_alignment=True,
random_state=i,
)
for i in range(num_random_samples)
]
edit_counts = sum((ma.edit_counts["raw"] for ma in multi_analyzers), start=Counter())
for edit, count in edit_counts.most_common(5):
print(f"Edit {edit} occured in average {count / num_random_samples} times.")
Edit REPLACED 'Ivar' -> 'Svar' occured in average 7.0 times.
Edit REPLACED 'eg' -> 'og' occured in average 7.0 times.
Edit REPLACED 'jej' -> 'jeg' occured in average 6.3 times.
Edit REPLACED 'De' -> 'de' occured in average 4.0 times.
Edit DELETED 'er' occured in average 3.1 times.
We see that 'Ivar' -> 'Svar' and 'eg' -> 'og' are still the most common replacements, while 'jej' -> 'jeg' has slightly fewer occurences.
Visualising the most common edits#
Let’s now use Stringalign’s visualisation tools to look at the three most common edits. We’ll do that with an interactive plot with the edit operations in sorted order.
[7]:
for edit, _count in edit_counts.most_common(3):
display(HTML(f"<h3>{edit}</h3>"))
for aa in multi_analyzer.alignment_operator_index["raw"][edit]:
image = stringalign.visualize.create_html_image(f"data/images/{aa.metadata['img_path']}")
# Each token is a word without spaces, so we add a space between each alignment operator
alignment_html = aa.visualize(space_alignment_ops=True)
display(HTML(image + alignment_html))
REPLACED 'Ivar' -> 'Svar'
REPLACED 'eg' -> 'og'
REPLACED 'jej' -> 'jeg'
When we e.g. inspect the errors where 'Ivar' is transcribed as 'Svar', we see that the 'I' very often looks like an 'S'. Moreover, “Ivar” is a name, while “Svar” is the Norwegian word for “Reply”, which could indicate that the model struggles with names.
We also see that while the 'eg' -> 'og' replacement can make visual in most cases, it does often resemble 'eg' more than 'og'. Furthermore, both 'eg' and 'jej' are synonyms with 'jeg', but 'eg' is Nynorsk and “jej” is a non-standard spelling. Thus, the model may struggle with other Norwegian written dialects than Bokmål.
Inspecting the lines with the lowest WER#
Another thing that can be very useful is inspecting the lines with the highest error rate to see if there is any pattern.
[8]:
# Look at the three lines with the lowest WER
alignment_analyzers = sorted(multi_analyzer.alignment_analyzers, key=lambda aa: aa.compute_ter(), reverse=True)
for aa in alignment_analyzers[:5]:
image = stringalign.visualize.create_html_image(f"data/images/{aa.metadata['img_path']}")
alignment_html = aa.visualize(space_alignment_ops=True)
display(HTML(image + alignment_html))
In this case, we see that most of the lines with the highest error rate consists of printed text, so if we want to improve the model, including printed text could be a useful path to consider.
Conclusion#
We see that Stringalign can be useful in finding what errors an automatic transcription model makes, which can help inform us if a given model is suited for a task, and how we could improve a given model’s performance.