stringalign.tokenize#
- class stringalign.tokenize.GraphemeClusterTokenizer(pre_tokenization_normalizer: StringNormalizer | None = None, post_tokenization_normalizer: StringNormalizer | None = None)[source]#
Turn a string into a list of extended grapheme clusters [Con25e].
This code uses the unicode_segmentation Rust crate to do split the text string into extended grapheme clusters.
- Parameters:
pre_tokenization_normalizer β An optional
stringalign.normalize.StringNormalizerto apply before splitting into extended grapheme clusters.post_tokenization_normalizer β An optional
stringalign.normalize.StringNormalizerto apply to each token after splitting.
Examples
>>> tokenizer = GraphemeClusterTokenizer() >>> tokenizer("abcπ³οΈβππ³οΈββ§οΈβ€οΈβπ₯") ['a', 'b', 'c', 'π³οΈβπ', 'π³οΈββ§οΈ', 'β€οΈβπ₯']
- class stringalign.tokenize.SplitAtWhitespaceTokenizer(pre_tokenization_normalizer: StringNormalizer | None = None, post_tokenization_normalizer: StringNormalizer | None = None)[source]#
Turn a text string into a list of words by splitting at whitespace characters.
This tokenizer will split at any whitespace character, including spaces, tabs, newlines and any other unicode whitespace character and some other characters also. See the Python documentation for
str.isspace()for more information.- Parameters:
pre_tokenization_normalizer β An optional
stringalign.normalize.StringNormalizerto apply before splitting at whitespace.post_tokenization_normalizer β An optional
stringalign.normalize.StringNormalizerto apply to each token after splitting.
Examples
>>> tokenizer = SplitAtWhitespaceTokenizer() >>> tokenizer("Hello World") ['Hello', 'World'] >>> tokenizer("'Hello', (World)!") ["'Hello',", '(World)!']
- class stringalign.tokenize.SplitAtWordBoundaryTokenizer(pre_tokenization_normalizer: StringNormalizer | None = None, post_tokenization_normalizer: StringNormalizer | None = None, remove_whitespace: bool = False)[source]#
Turn a text string into a list of tokens by splitting at word boundaries as described in [Con25e].
This code uses the unicode_segmentation Rust crate to split the text string at word boundaries.
- Parameters:
pre_tokenization_normalizer β An optional
stringalign.normalize.StringNormalizerto apply before splitting at word boundaries.post_tokenization_normalizer β An optional
stringalign.normalize.StringNormalizerto apply to each token after splitting.remove_whitespace β If True, remove tokens that are only whitespace after splitting.
Examples
>>> tokenizer = SplitAtWordBoundaryTokenizer() >>> tokenizer("Hello World") ['Hello', ' ', 'World'] >>> tokenizer("'Hello', (World)!") ["'", 'Hello', "'", ',', ' ', '(', 'World', ')', '!'] >>> tokenizer("Hello World!") ['Hello', ' ', 'World', '!'] >>> tokenizer = SplitAtWordBoundaryTokenizer(remove_whitespace=True) >>> tokenizer("Hello World!") ['Hello', 'World', '!']
- class stringalign.tokenize.Tokenizer(*args, **kwargs)[source]#
Callable that converts a string into a list of tokens, represented by strings with a method to join tokens.
- join(text: Iterable[str]) str[source]#
Join an iterable of tokens into a string. This is used to create combined alignment operations.
It is important that tokenizer(tokenizer.join(tokenizer(text))) == tokenizer(text), otherwise other logic in stringalign (namely error classification heuristics) may not work as expected.
- class stringalign.tokenize.UnicodeWordTokenizer(pre_tokenization_normalizer: StringNormalizer | None = None, post_tokenization_normalizer: StringNormalizer | None = None)[source]#
Turn a text string into a list of extracted words as described in [Con25e].
This code uses the unicode_segmentation Rust crate to do split the text string into words. Note that all punctuation is removed.
- Parameters:
pre_tokenization_normalizer β An optional
stringalign.normalize.StringNormalizerto apply before splitting into words.post_tokenization_normalizer β An optional
stringalign.normalize.StringNormalizerto apply to each token after splitting.
Examples
>>> tokenizer = UnicodeWordTokenizer() >>> tokenizer("Hello World") ['Hello', 'World'] >>> tokenizer("'Hello', (World)!") ['Hello', 'World']
- stringalign.tokenize.add_join(sep: str = ' ') Callable[[Callable[[str], list[str]]], Tokenizer][source]#
Decorator that join method to a tokenizer function. This allows the tokenizer to be used with the Tokenizer protocol.
- Parameters:
tokenizer β A tokenizer function that takes a string and returns a list of tokens.
sep (optional) β The separator to use when joining tokens. Defaults to a single space.
- Returns:
A wrapped tokenizer that has a join method.
- Return type: