`stringalign.normalize`#

class stringalign.normalize.StringNormalizer(normalization: Literal['NFC', 'NFD', 'NFKC', 'NFKD', None] = 'NFC', case_insensitive: bool = False, normalize_whitespace: bool = False, remove_whitespace: bool = False, remove_non_word_characters: bool = False, resolve_confusables: Literal['confusables', 'intentional', None] | dict[str, str] = None)[source]#

Simple string normalizer, used to remove “irrelevant” differences when comparing strings.

Parameters:

normalization – Which unicode normalization to use
case_insensitive – If true, run str.casefold to make all letters lowercase
normalize_whitespace – Turn any occurance of one or more whitespaces into exactly one regular space
remove_whitespace – Turn any occurance of one or more whitespaces into exactly one regular space
remove_non_word_characters – Remove any character non-alphabetic and non-numeric unicode characters except spaces.
resolve_confusables –
How to resolve confusable characters. If resolve_confusables is a string, then it should signify whether it’s the Unicode confusable or intentional confusable list that should be used. If it’s a dictionary, then any occurence of a key in the text will be replaced with its corresponding value (so {"a": "b"} will replace all occurences of “a” with “b” in the text). If it’s None, then no confusable characters will be resolved.

Confusables are resolved by first resolving each single code point confusable, which has the computational complexity \(O(n)\), with \(n\) being the string length. Then, we iterate over all confusables that span multiple code points and resolve them one-by-one, which has the computational complexity \(O(|\text{len}(\text{conf}) > 2| n)\), where \(|\text{len}(\text{conf})| > 1|\) is the number of confusables with more than one code point.

stringalign.normalize

Contents

stringalign.normalize#

`stringalign.normalize`#