Note
Go to the end to download the full example code.
Reading PAGE- and ALTO-XML#
Here, we show how you can load PAGE- and ALTO-XML files and open them in Stringalign. Again, we are using the Lærebog i de forskjellige Grene af Huusholdningen by Hanna Winsnes (1846) example, but this time with PAGE- and ALTO-XML files generated by OCR-D.
import xml.etree.ElementTree as ET
from pathlib import Path
import stringalign
Set parameters and load the references from a text file
data_path = Path("synthetic_transcription_data")
predictions_page = data_path / "predicted.page.xml"
predictions_alto = data_path / "predicted.alto.xml"
reference_path = data_path / "reference.txt"
references = reference_path.read_text().splitlines()
Loading PAGE XML#
You can use the following function to parse PAGE XML-files.
Note that the namespace (ns) must match the namespace tag in the PAGE XML-file.
You can find it by searching for xmlns in the PAGE XML file.
def get_page_text(file: Path) -> list[str]:
root = ET.parse(file)
ns = {"PAGE": "http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"}
unicode_tags = root.findall(".//PAGE:TextRegion/PAGE:TextLine/PAGE:TextEquiv/PAGE:Unicode", ns)
return [unicode_tag.text.strip() for unicode_tag in unicode_tags if unicode_tag.text.strip()]
page_predictions = get_page_text(predictions_page)
analyzer = stringalign.evaluate.MultiAlignmentAnalyzer.from_strings(
predictions=page_predictions, references=references, tokenizer=stringalign.tokenize.GraphemeClusterTokenizer()
)
print(analyzer.compute_ter())
0.047748976807639835
Loading ALTO XML#
ALTO XML is a bit more complicated than PAGE XML. We need to find the text line objects, iterate over its children and append the content to a list. Still, you can copy this (and potentially update the namespace) function and use it in your code as needed.
def get_alto_text(file: Path) -> list[str]:
root = ET.parse(file)
ns = "http://www.loc.gov/standards/alto/ns-v4#"
text_line_tags = root.findall(".//ALTO:TextLine", {"ALTO": ns})
out = []
for text_line_tag in text_line_tags:
line = []
for tag in text_line_tag:
tag_type = tag.tag.casefold()
if tag_type == f"{{{ns}}}string":
line.append(tag.attrib["CONTENT"])
elif tag_type == f"{{{ns}}}sp":
line.append(" ")
elif tag_type == f"{{{ns}}}hyp":
line.append(tag.attrib["CONTENT"])
elif tag_type == f"{{{ns}}}shape":
continue
else:
raise ValueError(tag)
if text := "".join(line).strip():
out.append(text)
return out
alto_predictions = get_alto_text(predictions_alto)
analyzer = stringalign.evaluate.MultiAlignmentAnalyzer.from_strings(
predictions=alto_predictions, references=references, tokenizer=stringalign.tokenize.GraphemeClusterTokenizer()
)
print(analyzer.compute_ter())
0.047748976807639835
Total running time of the script: (0 minutes 0.061 seconds)