Note

Go to the end to download the full example code.

Exploring common errors in a transcription#

It is often interesting to not only look at the rate of errors but also at what types of errors a model makes. Such exploration can reveal patterns in errors which can, for example, give insights into how to improve the performance of the model. This example will show how to use Stringalign to quickly get an overview of the top 10 most common errors in a transcription. We start by loading some toy data based on an excerpt from a digitized copy of Lærebog i de forskjellige Grene af Huusholdningen by Hanna Winsnes (1846).

from pathlib import Path

import stringalign
from stringalign.visualize import HtmlString, create_html_image

data_path = Path("synthetic_transcription_data")
predictions_path = data_path / "predicted.txt"
reference_path = data_path / "reference.txt"

predictions = predictions_path.read_text().splitlines()
references = reference_path.read_text().splitlines()
image_paths = data_path.glob("line*.jpg")
metadata = [{"image_path": data_path / f"line{line_num:02d}.jpg"} for line_num in range(14)]

After loading the data, we can create a stringalign.evaluate.MultiAlignmentAnalyzer which aligns all reference/prediction-pairs and makes it easy for us to explore common transcription errors.

analyzer = stringalign.evaluate.MultiAlignmentAnalyzer.from_strings(
    predictions=predictions,
    references=references,
    metadata=metadata,
)

We start by looking at the raw edit counts, which corresponds to single-token alignment operations.

counts = analyzer.edit_counts["raw"]
for operation, count in counts.most_common(10):
    print(f"{operation:21}: {count}")

REPLACED 's' -> 'f'  : 10
REPLACED 'k' -> 't'  : 8
REPLACED 'd' -> 'b'  : 6
INSERTED 'a'         : 3
REPLACED 'æ' -> 'e'  : 3
INSERTED 'r'         : 3
REPLACED 'm' -> 'n'  : 3
REPLACED 'ø' -> 'o'  : 3
DELETED  'n'         : 2
REPLACED 'n' -> 'm'  : 2

We see that the most common transcription error is converting s to f. We also see three æ -> a replacements and and insertions of e. It can also be useful to consider combined edit operation counts, as that allows us to find common multi-token replacements, insertions and deletions.

counts = analyzer.edit_counts["combined"]
for operation, count in counts.most_common(10):
    print(f"{operation:21}: {count}")

REPLACED 's' -> 'f'  : 8
REPLACED 'd' -> 'b'  : 6
REPLACED 'æ' -> 'ae' : 3
REPLACED 'm' -> 'rn' : 3
REPLACED 'ø' -> 'o'  : 3
REPLACED 'nn' -> 'm' : 2
REPLACED 'kk' -> 'tt': 2
REPLACED 'sk' -> 'ft': 2
REPLACED 'Q' -> 'D'  : 2
REPLACED 'y' -> 'n'  : 1

When we inspect the combined edit operation counts, we see that there are some common two-token replacements, such as æ -> ae, rn -> m and m -> nn.

Visualising the alignments#

It can also be useful to visualise the data to get a better understanding of why the model makes the mistakes it does. In particular, we can use the MultiAlignmentAnalyzer.alignment_operator_index method to iterate over the AlignmentAnalyzer instances that contain transcriptions with a specified edit.

most_common_error = counts.most_common(1)[0][0]

# We create a long HTML string to display the visualisation in Sphinx.
# If you're using Jupyter, then you can use ``IPython.display.display`` in each iteration instead.
table_html = ""
for line_analyzer in analyzer.alignment_operator_index["combined"][most_common_error]:
    image_html = create_html_image(line_analyzer.metadata["image_path"])
    alignment_html = line_analyzer.visualize(which="combined")
    table_html += image_html + alignment_html

HtmlString(table_html)

Reference:Predicted:

beleilig beleilig

s f

te; thi da te; thi da

k t

an man faae dem rene ved an man faae dem rene ved

Sk Ot

rubning, rubning,

Reference:Predicted:

og og

s f

lippe at lippe at

sk ft

r r

æ ae

lle de lle de

m rn

, og om de , og om de

nn m

e Gnidning i e Gnidning i

kk tt

e er e er

s f

a a

a n

Reference:Predicted:

at have en Potetesqv at have en Potetesqv

æ ae

rn at male dem paa, rn at male dem paa,

d b

a a

d b

et et

s f

parer parer

Reference:Predicted:

om om

A U

aret. Man faaer mere Meel af aret. Man faaer mere Meel af

s f

am am

m rn

e e

Q D

vantum vantum

Reference:Predicted:

Tid; men i en mindre Huusholdning lø Tid; men i en mindre Huusholdning lø

nn m

er det er det

s f

ig i ig i

kk tt

e at e at

Reference:Predicted:

an an

sk ft

affe en Ting, der affe en Ting, der

s f

jelden bliver brugt mere end en jelden bliver brugt mere end en

G O

ang ang

Reference:Predicted:

man. De t man. De t

ø o

rre meelagtige Potetes give mee rre meelagtige Potetes give mee

s f

t Meel, og det t Meel, og det

We see that the OCR-model struggles particularly with the long s of the Fraktur typeface, so a natural step to improve performance could be to train a model with more Fraktur text.

Total running time of the script: (0 minutes 0.055 seconds)

Gallery generated by Sphinx-Gallery

Exploring common errors in a transcription

Contents

Exploring common errors in a transcription#

Visualising the alignments#