Unicode normalization#
Some characters can be represented in unicode in several ways.
Here is one example for the Norwegian letter å:
print("å" == "å")
False
This can naturally lead to some ambiguities when comparing strings and calculating string metrics. To counteract these ambiguities, the Unicode Consortium has defined normalized forms for equivalent characters.
Equivalence#
To understand the normalized forms it’s useful to first look at what Unicode defines as equivalent characters.
The unicode standard defines two types of equivalence: canonical equivalence and compatibility equivalence [Con25a, Chapter 2.12] and [Con25c]
If code points or sequences of code points represent the same abstract character and should always look the same, that is canonical equivalence.
The "å" == "å" equivalence mentioned above is an example of this type of equivalence.
See Table 1 below for more examples.
There is also another type of equivalence, compatibility equivalence, that denotes whether characters or sequences of characters represent the same abstract character but with different visual appearances. The difference between compatibility equivalent forms can be purely stylistic in some contexts, but not in others. For example can they be used in mathematical notation to represent different information. So we need to be careful when deciding if using compatibility equivalence is appropriate. See Table 1 for some examples.
Character |
NFC |
NFD |
NFKC |
NFKD |
|---|---|---|---|---|
Å
212B |
Å
00C5 |
A ̊
0041 030A |
Å
00C5 |
A ̊
0041 030A |
Ω
2126 |
Ω
03A9 |
Ω
03A9 |
Ω
03A9 |
Ω
03A9 |
ñ
00F1 |
ñ
00F1 |
n ̃
006E 0303 |
ñ
00F1 |
n ̃
006E 0303 |
ṩ
1E69 |
ṩ
1E69 |
s ̣ ̇
0073 0323 0307 |
ṩ
1E69 |
s ̣ ̇
0073 0323 0307 |
ḍ̇
1E0B 0323 |
ḍ ̇
1E0D 0307 |
d ̣ ̇
0064 0323 0307 |
ḍ ̇
1E0D 0307 |
d ̣ ̇
0064 0323 0307 |
fi
FB01 |
fi
FB01 |
fi
FB01 |
fi
0066 0069 |
fi
0066 0069 |
4²
0034 00B2 |
4²
0034 00B2 |
4²
0034 00B2 |
42
0034 0032 |
42
0034 0032 |
ſ
017F |
ſ
017F |
ſ
017F |
s
0073 |
s
0073 |
Note
Two characters can be distinct, even after normalisation, and still look the same. See Confusables for more information.