Toy emoji OCR example#

This example demonstrates how Stringalign accurately computes evaluation metrics even for complex inputs like emojis, where other tools may return misleading results by default.

The default behaviour of, for example, Jiwer is not to cluster based on grapheme clusters, so if we compute the CER for strings with ZWJ-emoji sequences, we can get surprising results

import io
import json
from pathlib import Path

import jiwer
import PIL.Image
import stringalign

jiwer_cer = jiwer.cer("🐈‍⬛", "🐦‍⬛")
stringalign_cer, _analyzer = stringalign.evaluate.compute_cer("🐈‍⬛", "🐦‍⬛")
print("Jiwer:", jiwer_cer)
print("Stringalign:", stringalign_cer)

Jiwer: 0.3333333333333333
Stringalign: 1.0

We see that Jiwer gets only 1/3 CER, even though 100% of the characters are wrong. This artificially low error is caused by Jiwer tokenizing (and therefore aligning) based on code points, so 🐈‍⬛ and 🐦‍⬛will be treated as 🐈‍[ZWJ]⬛ and 🐦‍[ZWJ]⬛. Stringalign on the other hand, tokenizes based on grapheme clusters so 🐈‍⬛ and 🐦‍⬛ are correctly treated as two emojis and not six code points. (See Grapheme clusters for an introduction to grapheme clusters).

Lets see how we can use Stringalign to accurately calculate the CER for a synthetic dataset with some toy OCR transcriptions containing emojis.

import io
import json
from pathlib import Path

import PIL.Image
import stringalign


def load_image(path: Path | str) -> PIL.Image.Image:
    path = data_path / path
    return PIL.Image.open(io.BytesIO(path.read_bytes()))


data_path = Path("emoji_ocr_evaluation_data")
dataset = json.loads((data_path / "lines.json").read_text())

Look at one sample#

print(f"Gold standard:\n{dataset['samples'][0]['gold_standard']}\n")
print(f"Transcription:\n{dataset['samples'][0]['transcription']}\n")
load_image(dataset["samples"][0]["image"])

Gold standard:
🪄🐈‍⬛🎃
🐻‍❄️🎿🥶

Transcription:
🪄🐦‍⬛🎃
🐻⛸️🥶


<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=400x314 at 0x7F4154CE2660>

Evaluate transcriptions#

references = [sample["gold_standard"] for sample in dataset["samples"]]
predictions = [sample["transcription"] for sample in dataset["samples"]]

tokenizer = stringalign.tokenize.GraphemeClusterTokenizer()  # This is the default, but it's still nice to be explicit
evaluator = stringalign.evaluate.MultiAlignmentAnalyzer.from_strings(
    references=references, predictions=predictions, tokenizer=tokenizer
)

cer = evaluator.confusion_matrix.compute_token_error_rate()
print(f"The overall CER is {cer}")

The overall CER is 0.10619469026548672

Look at the performance for the different lines#

for alignment_error in evaluator.alignment_analyzers:
    sample_cer = alignment_error.confusion_matrix.compute_token_error_rate()
    jiwer_cer = jiwer.cer(alignment_error.reference, alignment_error.predicted)

    print(f"Reference:\n{alignment_error.reference}\n")
    print(f"Predicted:\n{alignment_error.predicted}\n")
    print(f"CER: {sample_cer:3.2%}, Jiwer CER: {jiwer_cer:3.2%}\n\n")

Reference:
🪄🐈‍⬛🎃
🐻‍❄️🎿🥶

Predicted:
🪄🐦‍⬛🎃
🐻⛸️🥶

CER: 42.86%, Jiwer CER: 33.33%


Reference:
Message
Lorem Ipsum
Hope you feel better soon!❤️‍🩹

Predicted:
Massage
Lorem lpsum
Hope you feel better soon!❤️

CER: 6.38%, Jiwer CER: 8.00%


Reference:
What a great idea😑

Predicted:
What a grea t idea🙂

CER: 11.11%, Jiwer CER: 11.11%


Reference:
Happy pride month! 🏳️‍🌈🌈🎉

Predicted:
Happy pride moth! 🏳️‍🌈🌈🎉

CER: 4.55%, Jiwer CER: 4.00%


Reference:
That was
close! 😮‍💨

Predicted:
That was
close!

CER: 11.76%, Jiwer CER: 21.05%


Reference:
1🐻‍❄️

Predicted:
🐻‍❄️

CER: 50.00%, Jiwer CER: 20.00%

We see that stringalign computes CER correctly, even for emojis.

Total running time of the script: (0 minutes 0.035 seconds)

Gallery generated by Sphinx-Gallery

Toy emoji OCR example

Contents

Toy emoji OCR example#

Look at one sample#

Evaluate transcriptions#

Look at the performance for the different lines#