stringalign.align

`stringalign.align`#

class stringalign.align.AlignmentOperation(*args, **kwargs)[source]#

This class class is used as a union of Deleted, Inserted, Replaced, and Kept.

We define it like this instead of using an explicit union type make the Sphinx documentation more readable.

generalize() → Kept | Replaced[source]#

simplify() → AlignmentOperation[source]#

to_html() → tuple[str, str][source]#

class stringalign.align.Deleted(substring: str)[source]#

Class representing tokens that are deleted from the reference.

For example, if the reference text is 'hello' and the predicted text is 'helo', then one 'l' is deleted (if we’re using character tokens).

generalize() → Replaced[source]#

simplify() → Self[source]#

substring: str#

to_html() → tuple[str, str][source]#

class stringalign.align.Inserted(substring: str)[source]#

Class representing tokens that are deleted from the reference.

For example, if the reference text is 'hello' and the predicted text is 'helloo', then one 'o' is inserted (if we’re using character tokens).

generalize() → Replaced[source]#

simplify() → Self[source]#

substring: str#

to_html() → tuple[str, str][source]#

class stringalign.align.Kept(substring: str)[source]#

Class representing tokens that are kept from the reference.

For example, if the reference text is 'hi' and the predicted text is 'h', then the 'h' is kept while the 'i' is not (if we’re using character tokens).

generalize() → Self[source]#

merge(other: Kept, tokenizer: Tokenizer) → Kept[source]#

property predicted: str#

property reference: str#

simplify() → Self[source]#

substring: str#

to_html() → tuple[str, str][source]#

class stringalign.align.Replaced(reference: str, predicted: str)[source]#

Class representing tokens that are from the reference.

For example, if the reference text is 'hello' and the predicted text is 'hellø', then one 'o' is replaced with a 'ø' (if we’re using character tokens).

generalize() → Self[source]#

merge(other: Replaced, tokenizer: Tokenizer) → Replaced[source]#

predicted: str#

reference: str#

simplify() → AlignmentOperation[source]#

to_html() → tuple[str, str][source]#

stringalign.align.align_strings(reference: str, predicted: str, tokenizer: Tokenizer | None = None, randomize_alignment: bool = False, random_state: Generator | int | None = None) → tuple[tuple[AlignmentOperation, ...], bool][source]#

Find one optimal alignment for the two strings and whether the alignment is unique or not.

It uses the Needleman-Wunsch algorithm for optimal string alignment [NW70], which is a dynamic programming algorithm with \(O(mn)\) time and memory complexity, where \(m\) and \(n\) are the length of the reference and predicted strings. This algorithm has been discovered many times, for a more thorough description, see e.g. [Nav01].

Parameters:

reference – The reference string, also known as gold standard or ground truth.
predicted – The string to align with the reference.
tokenizer (optional) – A tokenizer that turns a string into an iterable of tokens. For this function, it is sufficient that it is a callable that turns a string into an iterable of tokens. If not provided, then stringalign.tokenize.DEFAULT_TOKENIZER is used instead, which by default is a grapheme cluster (character) tokenizer.
randomize_alignment – If True, then a random optimal alignment is chosen (slightly slower if enabled)
random_state – The NumPy RNG or a seed to create a NumPy RNG used for picking the optimal alignment. If None, then the default RNG will be used instead.

Returns:

alignment (AlignmentTuple) – A tuple of alignment operations.
unique (bool) – A boolean flag that is True if alignment is unique and False otherwise.

stringalign.align.combine_alignment_ops(alignment: Iterable[AlignmentOperation], tokenizer: stringalign.tokenize.Tokenizer | None = None) → Generator[AlignmentOperation, None, None][source]#

Combine alignment operations to cover multiple tokens where possible.

Sometimes, it can be useful to combine multiple alignment operations into single ones to e.g. find common multi-token insertions, deletions or replacements. For example, in handwritten text recognition, the letters 'll' might be replaced with a 'u' or 'rn' might be replaced with an 'm'. Such replacements are easily missed if we just consider single-token replacements.

This generator will combine contiguous Kept operations and contiguous “edit” operations (Deleted, Inserted and Replaced) into single operations instead.

Parameters:

alignment – Iterable of single-token alignment operations to combine.
tokenizer (optional) – A tokenizer. The stringalign.tokenize.Tokenizer.join() method is used to combine tokens to create multi-token alignment operations. If not provided, then stringalign.tokenize.DEFAULT_TOKENIZER is used instead, which by default is a grapheme cluster (character) tokenizer.

Yields:

alignment_operation (AlignmentOperation) – Each alignment operation represents a contiguous block of either Kept operations or “edit” operations.

Examples

Contiguous Kept and edit operations are combined into single operations:

>>> alignment = [Kept("h"), Kept("e"), Replaced("l", "u"), Deleted("l"), Kept("o")]
>>> tuple(combine_alignment_ops(alignment))
(Kept(substring='he'), Replaced(reference='ll', predicted='u'), Kept(substring='o'))

Contiguous Deleted and Inserted operations keep their semantics when merged.

>>> alignment = [Kept("h"), Kept("e"), Deleted("l"), Deleted("l"), Kept("o"), Inserted("!")]
>>> tuple(combine_alignment_ops(alignment))
(Kept(substring='he'), Deleted(substring='ll'), Kept(substring='o'), Inserted(substring='!'))

stringalign.align.compute_levenshtein_distance_from_alignment(alignment: Iterable[AlignmentOperation]) → int[source]#

Compute the Levenshtein distance between two strings based on an optimal alignment between them.

See Levenshtein distance for more information about the Levenshtein distance.

Parameters:: alignment – An iterable representing the optimal alignment of two strings. Typically a tuple returned by align_strings().
Returns:: distance – The Levenshtein distance between the two strings.
Return type:: int

stringalign.align.create_cost_matrix(reference_tokens: Iterable[str], predicted_tokens: Iterable[str]) → np.ndarray[source]#

Create the alignment cost matrix for the reference tokens and predicted tokens.

Element (i, j) of this matrix corresponds to the cost of aligning the token with index i in the reference string with the token with index j in the predicted string. For more information, see e.g. [Nav01] or [NW70].

This is an internal function used by align_strings(), so you should probably not call this function directly.

Parameters:

reference_tokens – Iterable of tokens to align the predicted tokens against.
predicted_tokens – Iterable of tokens to align against the reference tokens.

Returns:

cost_matrix – Two dimensional numpy array of ints with shape (len(reference_tokens), len(predicted_tokens)).

Return type:

np.ndarray

stringalign.align.find_all_alignments(reference: str, predicted: str, tokenizer: stringalign.tokenize.Tokenizer | None = None) → Generator[AlignmentTuple, None, None][source]#

Works similarly to align_strings, but returns all possible alignments.

It’s implemented as a generator that yields all possible alignments. It holds a queue of alignments and every time the dynamic programming backtracking encounters a branching point, it creates adds the new branches to the queue.

The backtracking is completed for one alignment before the next is started (with no caching, so the same subpaths might be traversed multiple times).

This function has exponential worst-time time-complexity since the number of possible string alignments grows exponentially with the length of the strings.

Parameters:

reference – The reference string, also known as gold standard or ground truth.
predicted – The string to align with the reference.
tokenizer (optional) – A tokenizer that turns a string into an iterable of tokens. For this function, it is sufficient that it is a callable that turns a string into an iterable of tokens. If not provided, then stringalign.tokenize.DEFAULT_TOKENIZER is used instead, which by default is a grapheme cluster (character) tokenizer.

Yields:

alignment (AlignmentTuple) – A tuple of alignment operations.

stringalign.align.levenshtein_distance(reference: str, predicted: str, tokenizer: Tokenizer | None = None) → int[source]#

Compute the Levenshtein distance between two strings given a tokenizer.

See Levenshtein distance for more information about the Levenshtein distance.

Note

This function will first align the strings and then compute the Levenshtein distance. If you already have computed the alignment, you can use compute_levenshtein_distance_from_alignment() instead.

Parameters:

reference – The reference string, also known as gold standard or ground truth.
predicted – The string to align with the reference.
tokenizer – A tokenizer that turns a string into an iterable of tokens. For this function, it is sufficient that it is a callable that turns a string into an iterable of tokens.

Returns:

distance – The Levenshtein distance between the two strings.

Return type:

int

stringalign.align

Contents

stringalign.align#

`stringalign.align`#