Worked example: Inspecting an HTR model#

In this example, we will see how Stringalign can be used to inspect the performance of a handwritten text recognition (HTR) model (Sprakbanken/TrOCR-norhand-v3) on the test set of the Teklia/NorHand-v3 dataset.

This analysis will focus on word accuracies, as that’s were we’ve found the most interesting results for this model, but you can use the exact same strategy to investigate character errors as well.

Imports and loading the data#

[1]:
from collections import Counter

import pandas as pd
import stringalign
from IPython.display import HTML, display
from ipywidgets import interact
from stringalign.evaluate import MultiAlignmentAnalyzer
[2]:
dataset = pd.read_json("data/transcription.json")

Create the multi-alignment analyzer#

First, we create the multi-alignment analyzer. Notice that we also include metadata this time. This makes it easy to cross reference between the alignment analyzer instances and the rows in the dataset.

[3]:
multi_analyzer = MultiAlignmentAnalyzer.from_strings(
    references=dataset["reference"],
    predictions=dataset["predicted"],
    metadata=[{"img_path": row.img, "index": row.Index} for row in dataset.itertuples()],
    tokenizer=stringalign.tokenize.UnicodeWordTokenizer(),
)

Compute WER#

The first thing we do is compute the token error rate, which in this case equates to the WER

[4]:
print(f"The WER is: {multi_analyzer.compute_ter():.2%}")
The WER is: 15.05%

Find most common edits#

After computing the WER, it can be useful to see the most common edits made by the algorithm.

[5]:
for edit, count in multi_analyzer.edit_counts["raw"].most_common(5):
    print(f"Edit {edit} occured {count} times.")
Edit REPLACED 'jej' -> 'jeg' occured 7 times.
Edit REPLACED 'Ivar' -> 'Svar' occured 7 times.
Edit REPLACED 'eg' -> 'og' occured 7 times.
Edit REPLACED 'De' -> 'de' occured 4 times.
Edit DELETED  '1' occured 3 times.

Accounting for multiple possible optimal alignments#

As discussed in Sequence alignment, the optimal alignment is not unique, so we compute can double check the stability by randomising the selected optimal alignments and computing average counts.

In reality, it would be nice to compute the standard deviation of the counts as well, but we’ll skip that here. See e.g. Uncertainty metrics for token-specific statistics for an example where we compute the standard deviation for edit operation counts.

[6]:
num_random_samples = 10
multi_analyzers = [
    MultiAlignmentAnalyzer.from_strings(
        references=dataset["reference"],
        predictions=dataset["predicted"],
        metadata=[{"img_path": row.img, "index": row.Index} for row in dataset.itertuples()],
        tokenizer=stringalign.tokenize.UnicodeWordTokenizer(),
        randomize_alignment=True,
        random_state=i,
    )
    for i in range(num_random_samples)
]
edit_counts = sum((ma.edit_counts["raw"] for ma in multi_analyzers), start=Counter())

for edit, count in edit_counts.most_common(5):
    print(f"Edit {edit} occured in average {count / num_random_samples} times.")
Edit REPLACED 'Ivar' -> 'Svar' occured in average 7.0 times.
Edit REPLACED 'eg' -> 'og' occured in average 7.0 times.
Edit REPLACED 'jej' -> 'jeg' occured in average 6.3 times.
Edit REPLACED 'De' -> 'de' occured in average 4.0 times.
Edit DELETED  'er' occured in average 3.1 times.

We see that 'Ivar' -> 'Svar' and 'eg' -> 'og' are still the most common replacements, while 'jej' -> 'jeg' has slightly fewer occurences.

Visualising the most common edits#

Let’s now use Stringalign’s visualisation tools to look at the three most common edits. We’ll do that with an interactive plot with the edit operations in sorted order.

[7]:
for edit, _count in edit_counts.most_common(3):
    display(HTML(f"<h3>{edit}</h3>"))
    for aa in multi_analyzer.alignment_operator_index["raw"][edit]:
        image = stringalign.visualize.create_html_image(f"data/images/{aa.metadata['img_path']}")

        # Each token is a word without spaces, so we add a space between each alignment operator
        alignment_html = aa.visualize(space_alignment_ops=True)
        display(HTML(image + alignment_html))

REPLACED 'Ivar' -> 'Svar'

Reference:Predicted:
Hr Hr
Ivar Svar
Aasen Aasen
Reference:Predicted:
Herr Herr
Ivar Svar
Aasen Aasen
Reference:Predicted:
Herr Herr
filol fuld
Ivar Svar
Aasen Aasen
Reference:Predicted:
Ivar Svar
Aasen Aasen
Reference:Predicted:
Hr Hr
Ivar Svar
Aasen stasen
Reference:Predicted:
Ivar Svar
Aasen Aasen
Reference:Predicted:
Hr Hr
Ivar Svar
Aasen Aasen

REPLACED 'eg' -> 'og'

Reference:Predicted:
Det Det
var var
ein sin
liten liten
ting ting
eg og
kom kom
til til
aa aa
tenkje hurtige
paa paa
Reference:Predicted:
og og
som som
eg og
gjerne gjerne
vilde vilde
nemne henvæ
for for
for for
Reference:Predicted:
De Da
har har
fått fitt
ein in
ansvarsfull ansvarsfall
post part
paa
Noregs Norge
største støgete
talarstol tilestol
eg og
vil vil
Reference:Predicted:
mi nu
nokra noksa
Kronur Kronur
so so
vart vart
eg og
Reference:Predicted:
heil heil
brote brate
bøker bake
kvart kraft
aar over
som dem
eg og
bryr bryt
Reference:Predicted:
sård sért
nu
ber led
eg og
um min
tilgjeving tilgjæring
Og Og
eg jeg
vilde vilde
vere verre
takksam takksom
um nu
De de
Reference:Predicted:
vilde vilde
take tale
meg meg
ordlyda ordbyd
av av
det det
eg og
skreiv skreiv

REPLACED 'jej' -> 'jeg'

Reference:Predicted:
rede rede
tidligere tidligere
erklæret erklæret
Tønsberg Tønsberg
at at
jej jeg
under under
ingen ingen
om om
Reference:Predicted:
Hvad Hvad
derimod derimod
angår angår
det det
andet andet
vil
vilkår kår
da da
har har
jej jeg
alle alle
Reference:Predicted:
Efter Efter
aftale aftale
har har
jej jeg
været været
hos hos
konsul konsul
Tønsberg Tønsberg
for for
at at
Reference:Predicted:
hvorpå hvorpå
jej jeg
erklærede erklærede
ham ham
at at
isåfald isåfald
måtte måtte
sagen sagen
betrag betrag
Reference:Predicted:
gjøre gjøre
det det
fornødne fornødne
med med
Finne Tinne
og og
2
at sat
jej jeg
skulde skulde
overta overta
Reference:Predicted:
ste ste
vilkår vilkår
angår angår
da da
går går
Finnes Finnes
forlangende forlangende
såvidt såvidt
jej jeg
Reference:Predicted:
rer rer
O O
Bakke Bakke
der der
også også
efter efter
hvad hvad
jej jeg
har har
hørt hørt
dertil dertil
skal skal

When we e.g. inspect the errors where 'Ivar' is transcribed as 'Svar', we see that the 'I' very often looks like an 'S'. Moreover, “Ivar” is a name, while “Svar” is the Norwegian word for “Reply”, which could indicate that the model struggles with names.

We also see that while the 'eg' -> 'og' replacement can make visual in most cases, it does often resemble 'eg' more than 'og'. Furthermore, both 'eg' and 'jej' are synonyms with 'jeg', but 'eg' is Nynorsk and “jej” is a non-standard spelling. Thus, the model may struggle with other Norwegian written dialects than Bokmål.

Inspecting the lines with the lowest WER#

Another thing that can be very useful is inspecting the lines with the highest error rate to see if there is any pattern.

[8]:
# Look at the three lines with the lowest WER
alignment_analyzers = sorted(multi_analyzer.alignment_analyzers, key=lambda aa: aa.compute_ter(), reverse=True)

for aa in alignment_analyzers[:5]:
    image = stringalign.visualize.create_html_image(f"data/images/{aa.metadata['img_path']}")
    alignment_html = aa.visualize(space_alignment_ops=True)
    display(HTML(image + alignment_html))
Reference:Predicted:
TEATAere
R
b
THEATERCHEFEN EN
Reference:Predicted:
K
TJESeJSK
AKTIESELSKABET sRET
Reference:Predicted:
NORSES
FJSKERe
BJØR
NORGES Ke
FISKERISTYRELSE BLE
Reference:Predicted:
FORSø
FORSØGSSTATIONEN GSTATIONALONEN
Reference:Predicted:
21.7
21.7.1882 1882

In this case, we see that most of the lines with the highest error rate consists of printed text, so if we want to improve the model, including printed text could be a useful path to consider.

Conclusion#

We see that Stringalign can be useful in finding what errors an automatic transcription model makes, which can help inform us if a given model is suited for a task, and how we could improve a given model’s performance.