`stringalign.tokenize`#

class stringalign.tokenize.GraphemeClusterTokenizer(pre_tokenization_normalizer: StringNormalizer | None = None, post_tokenization_normalizer: StringNormalizer | None = None)[source]#

Turn a string into a list of extended grapheme clusters [UnicodeConsortium25e].

This code uses the unicode_segmentation Rust crate to do split the text string into extended grapheme clusters.

Parameters:

pre_tokenization_normalizer – An optional stringalign.normalize.StringNormalizer to apply before splitting into extended grapheme clusters.
post_tokenization_normalizer – An optional stringalign.normalize.StringNormalizer to apply to each token after splitting.

Examples

>>> tokenizer = GraphemeClusterTokenizer()
>>> tokenizer("abc🏳️‍🌈🏳️‍⚧️❤️‍🔥")
['a', 'b', 'c', '🏳️‍🌈', '🏳️‍⚧️', '❤️‍🔥']

join(tokens: Iterable[str]) → str[source]#

class stringalign.tokenize.SplitAtWhitespaceTokenizer(pre_tokenization_normalizer: StringNormalizer | None = None, post_tokenization_normalizer: StringNormalizer | None = None)[source]#

Turn a text string into a list of words by splitting at whitespace characters.

This tokenizer will split at any whitespace character, including spaces, tabs, newlines and any other unicode whitespace character and some other characters also. See the Python documentation for str.isspace() for more information.

Parameters:

pre_tokenization_normalizer – An optional stringalign.normalize.StringNormalizer to apply before splitting at whitespace.
post_tokenization_normalizer – An optional stringalign.normalize.StringNormalizer to apply to each token after splitting.

Examples

>>> tokenizer = SplitAtWhitespaceTokenizer()
>>> tokenizer("Hello World")
['Hello', 'World']
>>> tokenizer("'Hello', (World)!")
["'Hello',", '(World)!']

join(tokens: Iterable[str]) → str[source]#

class stringalign.tokenize.SplitAtWordBoundaryTokenizer(pre_tokenization_normalizer: StringNormalizer | None = None, post_tokenization_normalizer: StringNormalizer | None = None, remove_whitespace: bool = False)[source]#

Turn a text string into a list of tokens by splitting at word boundaries as described in [UnicodeConsortium25e].

This code uses the unicode_segmentation Rust crate to split the text string at word boundaries.

Parameters:

pre_tokenization_normalizer – An optional stringalign.normalize.StringNormalizer to apply before splitting at word boundaries.
post_tokenization_normalizer – An optional stringalign.normalize.StringNormalizer to apply to each token after splitting.
remove_whitespace – If True, remove tokens that are only whitespace after splitting.

Examples

>>> tokenizer = SplitAtWordBoundaryTokenizer()
>>> tokenizer("Hello World")
['Hello', ' ', 'World']
>>> tokenizer("'Hello', (World)!")
["'", 'Hello', "'", ',', ' ', '(', 'World', ')', '!']
>>> tokenizer("Hello  World!")
['Hello', '  ', 'World', '!']
>>> tokenizer = SplitAtWordBoundaryTokenizer(remove_whitespace=True)
>>> tokenizer("Hello  World!")
['Hello', 'World', '!']

join(tokens: Iterable[str]) → str[source]#

class stringalign.tokenize.Tokenizer(*args, **kwargs)[source]#

Callable that converts a string into a list of tokens, represented by strings with a method to join tokens.

join(text: Iterable[str]) → str[source]#

Join an iterable of tokens into a string. This is used to create combined alignment operations.

It is important that tokenizer(tokenizer.join(tokenizer(text))) == tokenizer(text), otherwise other logic in stringalign (namely error classification heuristics) may not work as expected.

class stringalign.tokenize.TokenizerReprMixin[source]#

class stringalign.tokenize.UnicodeWordTokenizer(pre_tokenization_normalizer: StringNormalizer | None = None, post_tokenization_normalizer: StringNormalizer | None = None)[source]#

Turn a text string into a list of extracted words as described in [UnicodeConsortium25e].

This code uses the unicode_segmentation Rust crate to do split the text string into words. Note that all punctuation is removed.

Parameters:

pre_tokenization_normalizer – An optional stringalign.normalize.StringNormalizer to apply before splitting into words.
post_tokenization_normalizer – An optional stringalign.normalize.StringNormalizer to apply to each token after splitting.

Examples

>>> tokenizer = UnicodeWordTokenizer()
>>> tokenizer("Hello World")
['Hello', 'World']
>>> tokenizer("'Hello', (World)!")
['Hello', 'World']

join(tokens: Iterable[str]) → str[source]#

stringalign.tokenize.add_join(sep: str = ' ') → Callable[[Callable[[str], list[str]]], Tokenizer][source]#

Decorator that join method to a tokenizer function. This allows the tokenizer to be used with the Tokenizer protocol.

Parameters:

tokenizer – A tokenizer function that takes a string and returns a list of tokens.
sep (optional) – The separator to use when joining tokens. Defaults to a single space.

Returns:

A wrapped tokenizer that has a join method.

Return type:

Tokenizer

stringalign.tokenize

Contents

stringalign.tokenize#

`stringalign.tokenize`#