stringalign.tokenize#

class stringalign.tokenize.GraphemeClusterTokenizer(pre_tokenization_normalizer: StringNormalizer | None = None, post_tokenization_normalizer: StringNormalizer | None = None)[source]#

Turn a string into a list of extended grapheme clusters [Con25e].

This code uses the unicode_segmentation Rust crate to do split the text string into extended grapheme clusters.

Parameters:

Examples

>>> tokenizer = GraphemeClusterTokenizer()
>>> tokenizer("abcπŸ³οΈβ€πŸŒˆπŸ³οΈβ€βš§οΈβ€οΈβ€πŸ”₯")
['a', 'b', 'c', 'πŸ³οΈβ€πŸŒˆ', 'πŸ³οΈβ€βš§οΈ', '❀️‍πŸ”₯']
join(tokens: Iterable[str]) str[source]#
class stringalign.tokenize.SplitAtWhitespaceTokenizer(pre_tokenization_normalizer: StringNormalizer | None = None, post_tokenization_normalizer: StringNormalizer | None = None)[source]#

Turn a text string into a list of words by splitting at whitespace characters.

This tokenizer will split at any whitespace character, including spaces, tabs, newlines and any other unicode whitespace character and some other characters also. See the Python documentation for str.isspace() for more information.

Parameters:

Examples

>>> tokenizer = SplitAtWhitespaceTokenizer()
>>> tokenizer("Hello World")
['Hello', 'World']
>>> tokenizer("'Hello', (World)!")
["'Hello',", '(World)!']
join(tokens: Iterable[str]) str[source]#
class stringalign.tokenize.SplitAtWordBoundaryTokenizer(pre_tokenization_normalizer: StringNormalizer | None = None, post_tokenization_normalizer: StringNormalizer | None = None, remove_whitespace: bool = False)[source]#

Turn a text string into a list of tokens by splitting at word boundaries as described in [Con25e].

This code uses the unicode_segmentation Rust crate to split the text string at word boundaries.

Parameters:

Examples

>>> tokenizer = SplitAtWordBoundaryTokenizer()
>>> tokenizer("Hello World")
['Hello', ' ', 'World']
>>> tokenizer("'Hello', (World)!")
["'", 'Hello', "'", ',', ' ', '(', 'World', ')', '!']
>>> tokenizer("Hello  World!")
['Hello', '  ', 'World', '!']
>>> tokenizer = SplitAtWordBoundaryTokenizer(remove_whitespace=True)
>>> tokenizer("Hello  World!")
['Hello', 'World', '!']
join(tokens: Iterable[str]) str[source]#
class stringalign.tokenize.Tokenizer(*args, **kwargs)[source]#

Callable that converts a string into a list of tokens, represented by strings with a method to join tokens.

join(text: Iterable[str]) str[source]#

Join an iterable of tokens into a string. This is used to create combined alignment operations.

It is important that tokenizer(tokenizer.join(tokenizer(text))) == tokenizer(text), otherwise other logic in stringalign (namely error classification heuristics) may not work as expected.

class stringalign.tokenize.TokenizerReprMixin[source]#
class stringalign.tokenize.UnicodeWordTokenizer(pre_tokenization_normalizer: StringNormalizer | None = None, post_tokenization_normalizer: StringNormalizer | None = None)[source]#

Turn a text string into a list of extracted words as described in [Con25e].

This code uses the unicode_segmentation Rust crate to do split the text string into words. Note that all punctuation is removed.

Parameters:

Examples

>>> tokenizer = UnicodeWordTokenizer()
>>> tokenizer("Hello World")
['Hello', 'World']
>>> tokenizer("'Hello', (World)!")
['Hello', 'World']
join(tokens: Iterable[str]) str[source]#
stringalign.tokenize.add_join(sep: str = ' ') Callable[[Callable[[str], list[str]]], Tokenizer][source]#

Decorator that join method to a tokenizer function. This allows the tokenizer to be used with the Tokenizer protocol.

Parameters:
  • tokenizer – A tokenizer function that takes a string and returns a list of tokens.

  • sep (optional) – The separator to use when joining tokens. Defaults to a single space.

Returns:

A wrapped tokenizer that has a join method.

Return type:

Tokenizer