Note
Go to the end to download the full example code.
Overview of tokenizations#
This example shows the different tokenization methods available in Stringalign and how you can easily add your own custom tokenization if you have any specific needs.
For these examples we will not add any string normalization (see this page for more on that).
See here for an example that adds a string normalization to a tokenizer.
And the stringalign.normalize API documentation for an overview of the normalization options available in Stringalign.
import stringalign
example_sentence = "Hello World! This is fun (example) sentence no. 10 000.😶🌫️"
stringalign.tokenize.GraphemeClusterTokenizer#
tokenizer = stringalign.tokenize.GraphemeClusterTokenizer()
print(tokenizer(example_sentence))
['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd', '!', ' ', 'T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'f', 'u', 'n', ' ', '(', 'e', 'x', 'a', 'm', 'p', 'l', 'e', ')', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', ' ', 'n', 'o', '.', ' ', '1', '0', ' ', '0', '0', '0', '.', '😶\u200d🌫️']
stringalign.tokenize.SplitAtWhitespaceTokenizer#
tokenizer = stringalign.tokenize.SplitAtWhitespaceTokenizer()
print(tokenizer(example_sentence))
['Hello', 'World!', 'This', 'is', 'fun', '(example)', 'sentence', 'no.', '10', '000.😶\u200d🌫️']
stringalign.tokenize.SplitAtWordBoundaryTokenizer#
tokenizer = stringalign.tokenize.SplitAtWordBoundaryTokenizer()
print(tokenizer(example_sentence))
['Hello', ' ', 'World', '!', ' ', 'This', ' ', 'is', ' ', 'fun', ' ', '(', 'example', ')', ' ', 'sentence', ' ', 'no', '.', ' ', '10', ' ', '000', '.', '😶\u200d🌫️']
stringalign.tokenize.UnicodeWordTokenizer#
tokenizer = stringalign.tokenize.UnicodeWordTokenizer()
print(tokenizer(example_sentence))
['Hello', 'World', 'This', 'is', 'fun', 'example', 'sentence', 'no', '10', '000']
Custom tokenizer using nb_tokenizer#
import nb_tokenizer
tokenizer = stringalign.tokenize.add_join()(nb_tokenizer.tokenize)
print(tokenizer(example_sentence))
['Hello', 'World', '!', 'This', 'is', 'fun', '(', 'example', ')', 'sentence', 'no.', '10 000', '.', '😶', '\u200d', '🌫', '️']
Total running time of the script: (0 minutes 0.037 seconds)