Reading PAGE- and ALTO-XML

Reading PAGE- and ALTO-XML#

Here, we show how you can load PAGE- and ALTO-XML files and open them in Stringalign. Again, we are using the Lærebog i de forskjellige Grene af Huusholdningen by Hanna Winsnes (1846) example, but this time with PAGE- and ALTO-XML files generated by OCR-D.

import xml.etree.ElementTree as ET
from pathlib import Path

import stringalign

Set parameters and load the references from a text file

data_path = Path("synthetic_transcription_data")
predictions_page = data_path / "predicted.page.xml"
predictions_alto = data_path / "predicted.alto.xml"
reference_path = data_path / "reference.txt"


references = reference_path.read_text().splitlines()

Loading PAGE XML#

You can use the following function to parse PAGE XML-files. Note that the namespace (ns) must match the namespace tag in the PAGE XML-file. You can find it by searching for xmlns in the PAGE XML file.

def get_page_text(file: Path) -> list[str]:
    root = ET.parse(file)
    ns = {"PAGE": "http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15"}
    unicode_tags = root.findall(".//PAGE:TextRegion/PAGE:TextLine/PAGE:TextEquiv/PAGE:Unicode", ns)

    return [unicode_tag.text.strip() for unicode_tag in unicode_tags if unicode_tag.text.strip()]


page_predictions = get_page_text(predictions_page)

analyzer = stringalign.evaluate.MultiAlignmentAnalyzer.from_strings(
    predictions=page_predictions, references=references, tokenizer=stringalign.tokenize.GraphemeClusterTokenizer()
)
print(analyzer.compute_ter())
0.047748976807639835

Loading ALTO XML#

ALTO XML is a bit more complicated than PAGE XML. We need to find the text line objects, iterate over its children and append the content to a list. Still, you can copy this (and potentially update the namespace) function and use it in your code as needed.

def get_alto_text(file: Path) -> list[str]:
    root = ET.parse(file)
    ns = "http://www.loc.gov/standards/alto/ns-v4#"
    text_line_tags = root.findall(".//ALTO:TextLine", {"ALTO": ns})

    out = []
    for text_line_tag in text_line_tags:
        line = []
        for tag in text_line_tag:
            tag_type = tag.tag.casefold()
            if tag_type == f"{{{ns}}}string":
                line.append(tag.attrib["CONTENT"])
            elif tag_type == f"{{{ns}}}sp":
                line.append(" ")
            elif tag_type == f"{{{ns}}}hyp":
                line.append(tag.attrib["CONTENT"])
            elif tag_type == f"{{{ns}}}shape":
                continue
            else:
                raise ValueError(tag)

        if text := "".join(line).strip():
            out.append(text)

    return out


alto_predictions = get_alto_text(predictions_alto)

analyzer = stringalign.evaluate.MultiAlignmentAnalyzer.from_strings(
    predictions=alto_predictions, references=references, tokenizer=stringalign.tokenize.GraphemeClusterTokenizer()
)
print(analyzer.compute_ter())
0.047748976807639835

Total running time of the script: (0 minutes 0.061 seconds)

Gallery generated by Sphinx-Gallery