Grapheme clusters#

When you measure the length of a string in Python using len(), what you get is the number of unicode code points in the string. However, code points are not characters and what a user thinks of as one character can sometimes be made up of multiple unicode code points. If you combine a and a circle, for instance, you get å. Users will think of this as a single character, but it contains two unicode code points under the hood.

for character in "å":
    code_point = hex(ord(character))
    print(f"{character!r}: {code_point}")

Therefore, if you are interested in the user perceived length of text, you shouldn’t segment the text based on code-points, but based on something called grapheme cluster boundaries. The Unicode standard has defined a default algorithm for detecting grapheme cluster boundaries in Unicode strings that should work well in most cases [1]. The grapheme clusters we obtain with this algorithm are called extended grapheme clusters.

Sometimes, we can normalize multi code point grapheme clusters into single code points. For example

import unicodedata
len(unicodedata.normalize("NFC", "å"))

However, that is not always possible. For example

import unicodedata
len(unicodedata.normalize("g̈")

Cannot be normalized into single code points and neither can Zero-with-joined emoji-sequences [UnicodeConsortium25b]

import unicodedata
len(unicodedata.normalize("🏳️‍🌈"))

Grapheme clusters are important to consider when we compute string metrics, like the Levenshtein distance. If we just naively compute string metrics (like how Levenshtein or Jiwer does by default), then the metrics will be wrong for multi-codepoint characters.

import Levenshtein
print(Levenshtein.distance("på", "p"))

Therefore, Stringalign will, by default, start by tokenizing strings into extended grapheme clusters (characters), and compute the character edits required to transform one string into another:

import stringalign
print(stringalign.align.levenshtein_distance("på", "p"))

If you’re interested in learning more about grapme clusters, then you can read the Unicode Technical Report #29 about segmenting unicode text.

Footnotes