Unicode normalization

Contents

Unicode normalization#

Some characters can be represented in unicode in several ways. Here is one example for the Norwegian letter å:

print("å" == "å")
False

This can naturally lead to some ambiguities when comparing strings and calculating string metrics. To counteract these ambiguities, the Unicode Consortium has defined normalized forms for equivalent characters.

Equivalence#

To understand the normalized forms it’s useful to first look at what Unicode defines as equivalent characters. The unicode standard defines two types of equivalence: canonical equivalence and compatibility equivalence [Con25a, Chapter 2.12] and [Con25c] If code points or sequences of code points represent the same abstract character and should always look the same, that is canonical equivalence. The "å" == "å" equivalence mentioned above is an example of this type of equivalence. See Table 1 below for more examples.

There is also another type of equivalence, compatibility equivalence, that denotes whether characters or sequences of characters represent the same abstract character but with different visual appearances. The difference between compatibility equivalent forms can be purely stylistic in some contexts, but not in others. For example can they be used in mathematical notation to represent different information. So we need to be careful when deciding if using compatibility equivalence is appropriate. See Table 1 for some examples.

Table 1: Equivalent code points#

Character

NFC

NFD

NFKC

NFKD

212B
Å
00C5
A ̊
0041 030A
Å
00C5
A ̊
0041 030A
2126
Ω
03A9
Ω
03A9
Ω
03A9
Ω
03A9
ñ
00F1
ñ
00F1
n ̃
006E 0303
ñ
00F1
n ̃
006E 0303
1E69
1E69
s ̣ ̇
0073 0323 0307
1E69
s ̣ ̇
0073 0323 0307
ḍ̇
1E0B 0323
ḍ ̇
1E0D 0307
d ̣ ̇
0064 0323 0307
ḍ ̇
1E0D 0307
d ̣ ̇
0064 0323 0307
FB01
FB01
FB01
fi
0066 0069
fi
0066 0069
0034 00B2
0034 00B2
0034 00B2
42
0034 0032
42
0034 0032
ſ
017F
ſ
017F
ſ
017F
s
0073
s
0073

Note

Two characters can be distinct, even after normalisation, and still look the same. See Confusables for more information.