Toy emoji OCR example#

This example demonstrates how Stringalign accurately computes evaluation metrics even for complex inputs like emojis, where other tools may return misleading results by default.

The default behaviour of, for example, Jiwer is not to cluster based on grapheme clusters, so if we compute the CER for strings with ZWJ-emoji sequences, we can get surprising results

import io
import json
from pathlib import Path

import jiwer
import PIL.Image
import stringalign

jiwer_cer = jiwer.cer("๐Ÿˆโ€โฌ›", "๐Ÿฆโ€โฌ›")
stringalign_cer, _analyzer = stringalign.evaluate.compute_cer("๐Ÿˆโ€โฌ›", "๐Ÿฆโ€โฌ›")
print("Jiwer:", jiwer_cer)
print("Stringalign:", stringalign_cer)
Jiwer: 0.3333333333333333
Stringalign: 1.0

We see that Jiwer gets only 1/3 CER, even though 100% of the characters are wrong. This artificially low error is caused by Jiwer tokenizing (and therefore aligning) based on code points, so ๐Ÿˆโ€โฌ› and ๐Ÿฆโ€โฌ›will be treated as ๐Ÿˆโ€[ZWJ]โฌ› and ๐Ÿฆโ€[ZWJ]โฌ›. Stringalign on the other hand, tokenizes based on grapheme clusters so ๐Ÿˆโ€โฌ› and ๐Ÿฆโ€โฌ› are correctly treated as two emojis and not six code points. (See Grapheme clusters for an introduction to grapheme clusters).

Lets see how we can use Stringalign to accurately calculate the CER for a synthetic dataset with some toy OCR transcriptions containing emojis.

import io
import json
from pathlib import Path

import PIL.Image
import stringalign


def load_image(path: Path | str) -> PIL.Image.Image:
    path = data_path / path
    return PIL.Image.open(io.BytesIO(path.read_bytes()))


data_path = Path("emoji_ocr_evaluation_data")
dataset = json.loads((data_path / "lines.json").read_text())

Look at one sample#

print(f"Gold standard:\n{dataset['samples'][0]['gold_standard']}\n")
print(f"Transcription:\n{dataset['samples'][0]['transcription']}\n")
load_image(dataset["samples"][0]["image"])
Gold standard:
๐Ÿช„๐Ÿˆโ€โฌ›๐ŸŽƒ
๐Ÿปโ€โ„๏ธ๐ŸŽฟ๐Ÿฅถ

Transcription:
๐Ÿช„๐Ÿฆโ€โฌ›๐ŸŽƒ
๐Ÿปโ›ธ๏ธ๐Ÿฅถ


<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=400x314 at 0x7F331BE6C6E0>

Evaluate transcriptions#

references = [sample["gold_standard"] for sample in dataset["samples"]]
predictions = [sample["transcription"] for sample in dataset["samples"]]

tokenizer = stringalign.tokenize.GraphemeClusterTokenizer()  # This is the default, but it's still nice to be explicit
evaluator = stringalign.evaluate.MultiAlignmentAnalyzer.from_strings(
    references=references, predictions=predictions, tokenizer=tokenizer
)

cer = evaluator.confusion_matrix.compute_token_error_rate()
print(f"The overall CER is {cer}")
The overall CER is 0.10619469026548672

Look at the performance for the different lines#

for alignment_error in evaluator.alignment_analyzers:
    sample_cer = alignment_error.confusion_matrix.compute_token_error_rate()
    jiwer_cer = jiwer.cer(alignment_error.reference, alignment_error.predicted)

    print(f"Reference:\n{alignment_error.reference}\n")
    print(f"Predicted:\n{alignment_error.predicted}\n")
    print(f"CER: {sample_cer:3.2%}, Jiwer CER: {jiwer_cer:3.2%}\n\n")
Reference:
๐Ÿช„๐Ÿˆโ€โฌ›๐ŸŽƒ
๐Ÿปโ€โ„๏ธ๐ŸŽฟ๐Ÿฅถ

Predicted:
๐Ÿช„๐Ÿฆโ€โฌ›๐ŸŽƒ
๐Ÿปโ›ธ๏ธ๐Ÿฅถ

CER: 42.86%, Jiwer CER: 33.33%


Reference:
Message
Lorem Ipsum
Hope you feel better soon!โค๏ธโ€๐Ÿฉน

Predicted:
Massage
Lorem lpsum
Hope you feel better soon!โค๏ธ

CER: 6.38%, Jiwer CER: 8.00%


Reference:
What a great idea๐Ÿ˜‘

Predicted:
What a grea t idea๐Ÿ™‚

CER: 11.11%, Jiwer CER: 11.11%


Reference:
Happy pride month! ๐Ÿณ๏ธโ€๐ŸŒˆ๐ŸŒˆ๐ŸŽ‰

Predicted:
Happy pride moth! ๐Ÿณ๏ธโ€๐ŸŒˆ๐ŸŒˆ๐ŸŽ‰

CER: 4.55%, Jiwer CER: 4.00%


Reference:
That was
close! ๐Ÿ˜ฎโ€๐Ÿ’จ

Predicted:
That was
close!

CER: 11.76%, Jiwer CER: 21.05%


Reference:
1๐Ÿปโ€โ„๏ธ

Predicted:
๐Ÿปโ€โ„๏ธ

CER: 50.00%, Jiwer CER: 20.00%

We see that stringalign computes CER correctly, even for emojis.

Total running time of the script: (0 minutes 0.010 seconds)

Gallery generated by Sphinx-Gallery