Worked example: Inspecting an HTR model#

In this example, we will see how Stringalign can be used to inspect the performance of a handwritten text recognition (HTR) model (Sprakbanken/TrOCR-norhand-v3) on the test set of the Teklia/NorHand-v3 dataset.

This analysis will focus on word accuracies, as that’s were we’ve found the most interesting results for this model, but you can use the exact same strategy to investigate character errors as well.

Imports and loading the data#

[1]:

from collections import Counter

import pandas as pd
import stringalign
from IPython.display import HTML, display
from ipywidgets import interact
from stringalign.evaluate import MultiAlignmentAnalyzer

[2]:

dataset = pd.read_json("data/transcription.json")

Create the multi-alignment analyzer#

First, we create the multi-alignment analyzer. Notice that we also include metadata this time. This makes it easy to cross reference between the alignment analyzer instances and the rows in the dataset.

[3]:

multi_analyzer = MultiAlignmentAnalyzer.from_strings(
    references=dataset["reference"],
    predictions=dataset["predicted"],
    metadata=[{"img_path": row.img, "index": row.Index} for row in dataset.itertuples()],
    tokenizer=stringalign.tokenize.UnicodeWordTokenizer(),
)

Compute WER#

The first thing we do is compute the token error rate, which in this case equates to the WER

[4]:

print(f"The WER is: {multi_analyzer.compute_ter():.2%}")

The WER is: 15.05%

Find most common edits#

After computing the WER, it can be useful to see the most common edits made by the algorithm.

[5]:

for edit, count in multi_analyzer.edit_counts["raw"].most_common(5):
    print(f"Edit {edit} occured {count} times.")

Edit REPLACED 'jej' -> 'jeg' occured 7 times.
Edit REPLACED 'Ivar' -> 'Svar' occured 7 times.
Edit REPLACED 'eg' -> 'og' occured 7 times.
Edit REPLACED 'De' -> 'de' occured 4 times.
Edit DELETED  '1' occured 3 times.

Accounting for multiple possible optimal alignments#

As discussed in Sequence alignment, the optimal alignment is not unique, so we compute can double check the stability by randomising the selected optimal alignments and computing average counts.

In reality, it would be nice to compute the standard deviation of the counts as well, but we’ll skip that here. See e.g. Uncertainty metrics for token-specific statistics for an example where we compute the standard deviation for edit operation counts.

[6]:

num_random_samples = 10
multi_analyzers = [
    MultiAlignmentAnalyzer.from_strings(
        references=dataset["reference"],
        predictions=dataset["predicted"],
        metadata=[{"img_path": row.img, "index": row.Index} for row in dataset.itertuples()],
        tokenizer=stringalign.tokenize.UnicodeWordTokenizer(),
        randomize_alignment=True,
        random_state=i,
    )
    for i in range(num_random_samples)
]
edit_counts = sum((ma.edit_counts["raw"] for ma in multi_analyzers), start=Counter())

for edit, count in edit_counts.most_common(5):
    print(f"Edit {edit} occured in average {count / num_random_samples} times.")

Edit REPLACED 'Ivar' -> 'Svar' occured in average 7.0 times.
Edit REPLACED 'eg' -> 'og' occured in average 7.0 times.
Edit REPLACED 'jej' -> 'jeg' occured in average 6.3 times.
Edit REPLACED 'De' -> 'de' occured in average 4.0 times.
Edit DELETED  'er' occured in average 3.1 times.

We see that 'Ivar' -> 'Svar' and 'eg' -> 'og' are still the most common replacements, while 'jej' -> 'jeg' has slightly fewer occurences.

Visualising the most common edits#

Let’s now use Stringalign’s visualisation tools to look at the three most common edits. We’ll do that with an interactive plot with the edit operations in sorted order.

[7]:

for edit, _count in edit_counts.most_common(3):
    display(HTML(f"<h3>{edit}</h3>"))
    for aa in multi_analyzer.alignment_operator_index["raw"][edit]:
        image = stringalign.visualize.create_html_image(f"data/images/{aa.metadata['img_path']}")

        # Each token is a word without spaces, so we add a space between each alignment operator
        alignment_html = aa.visualize(space_alignment_ops=True)
        display(HTML(image + alignment_html))

REPLACED 'Ivar' -> 'Svar'

Reference:Predicted:

Hr Hr

Ivar Svar

Aasen Aasen

Reference:Predicted:

Herr Herr

Ivar Svar

Aasen Aasen

Reference:Predicted:

Herr Herr

filol fuld

Ivar Svar

Aasen Aasen

Reference:Predicted:

Ivar Svar

Aasen Aasen

Reference:Predicted:

Hr Hr

Ivar Svar

Aasen stasen

Reference:Predicted:

Ivar Svar

Aasen Aasen

Reference:Predicted:

Hr Hr

Ivar Svar

Aasen Aasen

REPLACED 'eg' -> 'og'

Reference:Predicted:

Det Det

var var

ein sin

liten liten

ting ting

eg og

kom kom

til til

aa aa

tenkje hurtige

paa paa

Reference:Predicted:

og og

som som

eg og

gjerne gjerne

vilde vilde

nemne henvæ

for for

Reference:Predicted:

De Da

har har

fått fitt

ein in

ansvarsfull ansvarsfall

post part

på paa

Noregs Norge

største støgete

talarstol tilestol

eg og

vil vil

Reference:Predicted:

mi nu

nokra noksa

Kronur Kronur

so so

vart vart

eg og

Reference:Predicted:

heil heil

brote brate

bøker bake

kvart kraft

aar over

som dem

eg og

bryr bryt

Reference:Predicted:

sård sért

så nu

ber led

eg og

um min

tilgjeving tilgjæring

Og Og

eg jeg

vilde vilde

vere verre

takksam takksom

um nu

De de

Reference:Predicted:

vilde vilde

take tale

meg meg

på få

ordlyda ordbyd

av av

det det

eg og

skreiv skreiv

REPLACED 'jej' -> 'jeg'

Reference:Predicted:

rede rede

tidligere tidligere

erklæret erklæret

Tønsberg Tønsberg

at at

jej jeg

under under

ingen ingen

om om

Reference:Predicted:

Hvad Hvad

derimod derimod

angår angår

det det

andet andet

vil

vilkår kår

da da

har har

jej jeg

alle alle

Reference:Predicted:

Efter Efter

aftale aftale

har har

jej jeg

været været

hos hos

konsul konsul

Tønsberg Tønsberg

for for

at at

Reference:Predicted:

hvorpå hvorpå

jej jeg

erklærede erklærede

ham ham

at at

isåfald isåfald

måtte måtte

sagen sagen

betrag betrag

Reference:Predicted:

gjøre gjøre

det det

fornødne fornødne

med med

Finne Tinne

og og

at sat

jej jeg

skulde skulde

overta overta

Reference:Predicted:

ste ste

vilkår vilkår

angår angår

da da

går går

Finnes Finnes

forlangende forlangende

såvidt såvidt

jej jeg

Reference:Predicted:

rer rer

O O

Bakke Bakke

der der

også også

efter efter

hvad hvad

jej jeg

har har

hørt hørt

dertil dertil

skal skal

When we e.g. inspect the errors where 'Ivar' is transcribed as 'Svar', we see that the 'I' very often looks like an 'S'. Moreover, “Ivar” is a name, while “Svar” is the Norwegian word for “Reply”, which could indicate that the model struggles with names.

We also see that while the 'eg' -> 'og' replacement can make visual in most cases, it does often resemble 'eg' more than 'og'. Furthermore, both 'eg' and 'jej' are synonyms with 'jeg', but 'eg' is Nynorsk and “jej” is a non-standard spelling. Thus, the model may struggle with other Norwegian written dialects than Bokmål.

Inspecting the lines with the lowest WER#

Another thing that can be very useful is inspecting the lines with the highest error rate to see if there is any pattern.

[8]:

# Look at the three lines with the lowest WER
alignment_analyzers = sorted(multi_analyzer.alignment_analyzers, key=lambda aa: aa.compute_ter(), reverse=True)

for aa in alignment_analyzers[:5]:
    image = stringalign.visualize.create_html_image(f"data/images/{aa.metadata['img_path']}")
    alignment_html = aa.visualize(space_alignment_ops=True)
    display(HTML(image + alignment_html))

Reference:Predicted:

TEATAere

THEATERCHEFEN EN

Reference:Predicted:

TJESeJSK

AKTIESELSKABET sRET

Reference:Predicted:

NORSES

FJSKERe

BJØR

NORGES Ke

FISKERISTYRELSE BLE

Reference:Predicted:

FORSø

FORSØGSSTATIONEN GSTATIONALONEN

Reference:Predicted:

21.7

21.7.1882 1882

In this case, we see that most of the lines with the highest error rate consists of printed text, so if we want to improve the model, including printed text could be a useful path to consider.

Conclusion#

We see that Stringalign can be useful in finding what errors an automatic transcription model makes, which can help inform us if a given model is suited for a task, and how we could improve a given model’s performance.

Worked example: Inspecting an HTR model

Contents

Worked example: Inspecting an HTR model#

Imports and loading the data#

Create the multi-alignment analyzer#

Compute WER#

Find most common edits#

Accounting for multiple possible optimal alignments#

Visualising the most common edits#

REPLACED 'Ivar' -> 'Svar'

REPLACED 'eg' -> 'og'

REPLACED 'jej' -> 'jeg'

Inspecting the lines with the lowest WER#

Conclusion#