Resolving confusables and ligatures with custom lists#

Some historical documents contain ligatures and symbols that are not a part of Unicode. To account for that, several projects use Unicode’s private use area (e.g. MUFI). Different datasets may also have differing annotation guidelines, e.g. regarding how to annotate archaic ligatures. Many of these transcription differences can be resolved by resolving a task-specific list of confusables.

import stringalign
from stringalign.evaluate import AlignmentAnalyzer

Input data#

We examine a sentence from the IMPACT Dataset [PPCA13]. Specifically, we select a line from the sample with PRIMA ID 00046895. We also use a predicted line from a Tesseract model trained on the GT4Hist dataset [SRDB18] [1] This particular example is originally used in [NBG+21].

reference = "eingerien /  viel guter Leu⸗"
predicted = "eingeriſſan/ ſich viel guter Leü⸗"
print(f"Reference: {reference}")
print(f"Predicted: {predicted}")

Reference: eingerien /  viel guter Leu⸗
Predicted: eingeriſſan/ ſich viel guter Leü⸗

Compute the CER without resolving confusables#

tokenizer_default = stringalign.tokenize.GraphemeClusterTokenizer()

alignment_analyzer_default = AlignmentAnalyzer.from_strings(reference, predicted, tokenizer=tokenizer_default)

cer_default = alignment_analyzer_default.compute_ter()
print(f"The character error rate (without resolving comfusables) is {cer_default:.2f}")

The character error rate (without resolving comfusables) is 0.29

Setup confusable mapping#

The OCR evaluation tool Dinglehopper has a list of confusables that it resolves by default. We have copied that list (with comments) from Dinglehopper’s source code [2].

confusable_map = {
    "": "ſſ",
    "\ueba7": "ſſi",  # MUFI: LATIN SMALL LIGATURE LONG S LONG S I
    "": "ch",
    "": "ck",
    "": "ll",
    "": "ſi",
    "": "ſt",
    "ﬁ": "fi",
    "ﬀ": "ff",
    "ﬂ": "fl",
    "ﬃ": "ffi",
    "": "ct",
    "": "tz",  # MUFI: LATIN SMALL LIGATURE TZ
    "\uf532": "as",  # eMOP: Latin small ligature as
    "\uf533": "is",  # eMOP: Latin small ligature is
    "\uf534": "us",  # eMOP: Latin small ligature us
    "\uf535": "Qu",  # eMOP: Latin ligature capital Q small u
    "ĳ": "ij",  # U+0133 LATIN SMALL LIGATURE IJ
    "\ue8bf": "q&",
    # MUFI: LATIN SMALL LETTER Q LIGATED WITH FINAL ET
    # XXX How to replace this correctly?
    "\ueba5": "ſp",  # MUFI: LATIN SMALL LIGATURE LONG S P
    "ﬆ": "st",  # U+FB06 LATIN SMALL LIGATURE ST
} | {
    "": "ü",
    "": "ä",
    "==": "–",  # → en-dash
    "—": "–",  # em-dash → en-dash
    "": "ö",
    "’": "'",
    "⸗": "-",
    "aͤ": "ä",  # LATIN SMALL LETTER A, COMBINING LATIN SMALL LETTER E
    "oͤ": "ö",  # LATIN SMALL LETTER O, COMBINING LATIN SMALL LETTER E
    "uͤ": "ü",  # LATIN SMALL LETTER U, COMBINING LATIN SMALL LETTER E
    "\uf50e": "q́",  # U+F50E LATIN SMALL LETTER Q WITH ACUTE ACCENT
}

Compute the CER while resolving confusables#

tokenizer_confusables = stringalign.tokenize.GraphemeClusterTokenizer(
    pre_tokenization_normalizer=stringalign.normalize.StringNormalizer(resolve_confusables=confusable_map)
)

alignment_analyzer_confusables = AlignmentAnalyzer.from_strings(reference, predicted, tokenizer=tokenizer_confusables)

cer_confusables = alignment_analyzer_confusables.compute_ter()
print(f"The character error rate (after resolving confusables) is {cer_confusables:.2f}")

The character error rate (after resolving confusables) is 0.09

Look at strings after resolving confusables#

print("Reference:")
print(f"without resolving confusables and tokenizing: {tokenizer_default(reference)}")
print(f"  after resolving confusables and tokenizing: {tokenizer_confusables(reference)}")

print("Predicted:")
print(f"without resolving confusables and tokenizing: {tokenizer_default(predicted)}")
print(f"  after resolving confusables and tokenizing: {tokenizer_confusables(predicted)}")

Reference:
without resolving confusables and tokenizing: ['e', 'i', 'n', 'g', 'e', 'r', 'i', '\ueba6', 'e', 'n', ' ', '/', ' ', '\ueba2', '\uf502', ' ', 'v', 'i', 'e', 'l', ' ', 'g', 'u', 't', 'e', 'r', ' ', 'L', 'e', 'u', '⸗']
  after resolving confusables and tokenizing: ['e', 'i', 'n', 'g', 'e', 'r', 'i', 'ſ', 'ſ', 'e', 'n', ' ', '/', ' ', 'ſ', 'i', 'c', 'h', ' ', 'v', 'i', 'e', 'l', ' ', 'g', 'u', 't', 'e', 'r', ' ', 'L', 'e', 'u', '-']
Predicted:
without resolving confusables and tokenizing: ['e', 'i', 'n', 'g', 'e', 'r', 'i', 'ſ', 'ſ', 'a', 'n', '/', ' ', 'ſ', 'i', 'c', 'h', ' ', 'v', 'i', 'e', 'l', ' ', 'g', 'u', 't', 'e', 'r', ' ', 'L', 'e', 'ü', '⸗']
  after resolving confusables and tokenizing: ['e', 'i', 'n', 'g', 'e', 'r', 'i', 'ſ', 'ſ', 'a', 'n', '/', ' ', 'ſ', 'i', 'c', 'h', ' ', 'v', 'i', 'e', 'l', ' ', 'g', 'u', 't', 'e', 'r', ' ', 'L', 'e', 'ü', '-']

Footnotes

Total running time of the script: (0 minutes 0.015 seconds)

Gallery generated by Sphinx-Gallery