stringalign.normalize#

class stringalign.normalize.StringNormalizer(normalization: Literal['NFC', 'NFD', 'NFKC', 'NFKD', None] = 'NFC', case_insensitive: bool = False, normalize_whitespace: bool = False, remove_whitespace: bool = False, remove_non_word_characters: bool = False, resolve_confusables: Literal['confusables', 'intentional', None] | dict[str, str] = None)[source]#

Simple string normalizer, used to remove “irrelevant” differences when comparing strings.

Parameters:
  • normalization – Which unicode normalization to use

  • case_insensitive – If true, run str.casefold to make all letters lowercase

  • normalize_whitespace – Turn any occurance of one or more whitespaces into exactly one regular space

  • remove_whitespace – Turn any occurance of one or more whitespaces into exactly one regular space

  • remove_non_word_characters – Remove any character non-alphabetic and non-numeric unicode characters except spaces.

  • resolve_confusables – How to resolve confusable characters. If it’s a string, then it should signify whether it’s the Unicode confusable or intentional confusable list that should be used. If it’s a dictionary, then any occurence of a key in the text will be replaced with its corresponding value (so {"a": "b"} will replace all occurences of “a” with “b” in the text). If it’s None, then no confusable characters will be resolved.

stringalign.normalize.load_confusable_map(confusable_type: Literal['confusables', 'intentional']) dict[str, str][source]#

Load a confusable character mapping from a JSON file.

Can either load ‘confusable characters’ or ‘intentional confusables’.

Confusable characters are based on on the official Unicode list of confusable characters, i.e. characters that often look visually similar (e.g. ρ (lowercase rho) and p). It is available at https://www.unicode.org/Public/security/latest/confusables.txt

Intentional confusables are based on the official list of characters that are probably designed to be identical when using a harmonized typeface design (e.g. а (cyrillic) and a (latin)). The list is available at https://www.unicode.org/Public/security/latest/intentional.txt

For more information, see the Unicode Technical Standard #39 (UTS #39) about security consideration for unicode at https://www.unicode.org/reports/tr39/.

Parameters:

confusable_type – The type of confusable characters to load. Can be either “confusables” or “intentional”.

Returns:

A mapping of confusable characters to their corresponding counterparts.

Return type:

dict[str, str]

stringalign.normalize.normalize_whitespace(text: str) str[source]#

Normalize whitespace in the text to a single space.

stringalign.normalize.remove_non_word_characters(text: str) str[source]#

Remove all non-word characters from the text, except spaces.

stringalign.normalize.remove_whitespace(text: str) str[source]#

Remove all whitespace from the text.

stringalign.normalize.resolve_confusables(text: str, confusable_map: dict[str, str]) str[source]#

Resolve confusable characters in the text using the provided mapping.