Overview of tokenizations

Contents

Note

Go to the end to download the full example code.

Overview of tokenizations#

This example shows the different tokenization methods available in Stringalign and how you can easily add your own custom tokenization if you have any specific needs.

For these examples we will not add any string normalization (see this page for more on that). See here for an example that adds a string normalization to a tokenizer. And the stringalign.normalize API documentation for an overview of the normalization options available in Stringalign.

import stringalign

example_sentence = "Hello World! This is fun (example) sentence no. 10 000.😶‍🌫️"

`stringalign.tokenize.GraphemeClusterTokenizer`#

tokenizer = stringalign.tokenize.GraphemeClusterTokenizer()
print(tokenizer(example_sentence))

['H', 'e', 'l', 'l', 'o', ' ', 'W', 'o', 'r', 'l', 'd', '!', ' ', 'T', 'h', 'i', 's', ' ', 'i', 's', ' ', 'f', 'u', 'n', ' ', '(', 'e', 'x', 'a', 'm', 'p', 'l', 'e', ')', ' ', 's', 'e', 'n', 't', 'e', 'n', 'c', 'e', ' ', 'n', 'o', '.', ' ', '1', '0', ' ', '0', '0', '0', '.', '😶\u200d🌫️']

`stringalign.tokenize.SplitAtWhitespaceTokenizer`#

tokenizer = stringalign.tokenize.SplitAtWhitespaceTokenizer()
print(tokenizer(example_sentence))

['Hello', 'World!', 'This', 'is', 'fun', '(example)', 'sentence', 'no.', '10', '000.😶\u200d🌫️']

`stringalign.tokenize.SplitAtWordBoundaryTokenizer`#

tokenizer = stringalign.tokenize.SplitAtWordBoundaryTokenizer()
print(tokenizer(example_sentence))

['Hello', ' ', 'World', '!', ' ', 'This', ' ', 'is', ' ', 'fun', ' ', '(', 'example', ')', ' ', 'sentence', ' ', 'no', '.', ' ', '10', ' ', '000', '.', '😶\u200d🌫️']

`stringalign.tokenize.UnicodeWordTokenizer`#

tokenizer = stringalign.tokenize.UnicodeWordTokenizer()
print(tokenizer(example_sentence))

['Hello', 'World', 'This', 'is', 'fun', 'example', 'sentence', 'no', '10', '000']

Custom tokenizer using nb_tokenizer#

import nb_tokenizer

tokenizer = stringalign.tokenize.add_join()(nb_tokenizer.tokenize)

print(tokenizer(example_sentence))

['Hello', 'World', '!', 'This', 'is', 'fun', '(', 'example', ')', 'sentence', 'no.', '10 000', '.', '😶', '\u200d', '🌫', '️']

Total running time of the script: (0 minutes 0.020 seconds)

Gallery generated by Sphinx-Gallery