Note
Go to the end to download the full example code.
Toy emoji OCR example#
This example demonstrates how Stringalign accurately computes evaluation metrics even for complex inputs like emojis, where other tools may return misleading results by default.
The default behaviour of, for example, Jiwer is not to cluster based on grapheme clusters, so if we compute the CER for strings with ZWJ-emoji sequences, we can get surprising results
import io
import json
from pathlib import Path
import jiwer
import PIL.Image
import stringalign
jiwer_cer = jiwer.cer("๐โโฌ", "๐ฆโโฌ")
stringalign_cer, _analyzer = stringalign.evaluate.compute_cer("๐โโฌ", "๐ฆโโฌ")
print("Jiwer:", jiwer_cer)
print("Stringalign:", stringalign_cer)
Jiwer: 0.3333333333333333
Stringalign: 1.0
We see that Jiwer gets only 1/3 CER, even though 100% of the characters are wrong. This artificially low error is caused by Jiwer tokenizing (and therefore aligning) based on code points, so ๐โโฌ and ๐ฆโโฌwill be treated as ๐โ[ZWJ]โฌ and ๐ฆโ[ZWJ]โฌ. Stringalign on the other hand, tokenizes based on grapheme clusters so ๐โโฌ and ๐ฆโโฌ are correctly treated as two emojis and not six code points. (See Grapheme clusters for an introduction to grapheme clusters).
Lets see how we can use Stringalign to accurately calculate the CER for a synthetic dataset with some toy OCR transcriptions containing emojis.
import io
import json
from pathlib import Path
import PIL.Image
import stringalign
def load_image(path: Path | str) -> PIL.Image.Image:
path = data_path / path
return PIL.Image.open(io.BytesIO(path.read_bytes()))
data_path = Path("emoji_ocr_evaluation_data")
dataset = json.loads((data_path / "lines.json").read_text())
Look at one sample#
Gold standard:
๐ช๐โโฌ๐
๐ปโโ๏ธ๐ฟ๐ฅถ
Transcription:
๐ช๐ฆโโฌ๐
๐ปโธ๏ธ๐ฅถ
<PIL.JpegImagePlugin.JpegImageFile image mode=RGB size=400x314 at 0x7F331BE6C6E0>
Evaluate transcriptions#
references = [sample["gold_standard"] for sample in dataset["samples"]]
predictions = [sample["transcription"] for sample in dataset["samples"]]
tokenizer = stringalign.tokenize.GraphemeClusterTokenizer() # This is the default, but it's still nice to be explicit
evaluator = stringalign.evaluate.MultiAlignmentAnalyzer.from_strings(
references=references, predictions=predictions, tokenizer=tokenizer
)
cer = evaluator.confusion_matrix.compute_token_error_rate()
print(f"The overall CER is {cer}")
The overall CER is 0.10619469026548672
Look at the performance for the different lines#
for alignment_error in evaluator.alignment_analyzers:
sample_cer = alignment_error.confusion_matrix.compute_token_error_rate()
jiwer_cer = jiwer.cer(alignment_error.reference, alignment_error.predicted)
print(f"Reference:\n{alignment_error.reference}\n")
print(f"Predicted:\n{alignment_error.predicted}\n")
print(f"CER: {sample_cer:3.2%}, Jiwer CER: {jiwer_cer:3.2%}\n\n")
Reference:
๐ช๐โโฌ๐
๐ปโโ๏ธ๐ฟ๐ฅถ
Predicted:
๐ช๐ฆโโฌ๐
๐ปโธ๏ธ๐ฅถ
CER: 42.86%, Jiwer CER: 33.33%
Reference:
Message
Lorem Ipsum
Hope you feel better soon!โค๏ธโ๐ฉน
Predicted:
Massage
Lorem lpsum
Hope you feel better soon!โค๏ธ
CER: 6.38%, Jiwer CER: 8.00%
Reference:
What a great idea๐
Predicted:
What a grea t idea๐
CER: 11.11%, Jiwer CER: 11.11%
Reference:
Happy pride month! ๐ณ๏ธโ๐๐๐
Predicted:
Happy pride moth! ๐ณ๏ธโ๐๐๐
CER: 4.55%, Jiwer CER: 4.00%
Reference:
That was
close! ๐ฎโ๐จ
Predicted:
That was
close!
CER: 11.76%, Jiwer CER: 21.05%
Reference:
1๐ปโโ๏ธ
Predicted:
๐ปโโ๏ธ
CER: 50.00%, Jiwer CER: 20.00%
We see that stringalign computes CER correctly, even for emojis.
Total running time of the script: (0 minutes 0.010 seconds)