Professional Documents
Culture Documents
UTF-8 is a compromise character encoding that can be as compact as ASCII (if the
file is just plain English text) but can also contain any unicode characters (w
ith some increase in file size).
UTF stands for Unicode Transformation Format. The '8' means it uses 8-bit blocks
to represent a character. The number of blocks needed to represent a character
varies from 1 to 4.
One of the really nice features of UTF-8 is that it is compatible with nul-termi
nated strings. No character will have a nul (0) byte when encoded. This means th
at C code that deals with char[] will "just work".
You can try the UTF-8 Test Page to see how well your browser (and default font)
support UTF-8.
If you are an application developer, this Joel On Software article on Unicode is
pretty good summary of all you need to know.
More links:
If you are into the gory details, the official spec is RFC 3629
Markus Kuhn's FAQ
Rob Pike's story about the invention of it
Detail
For any character equal to or below 127 (hex 0x7F), the UTF-8 representation is
one byte. It is just the lowest 7 bits of the full unicode value. This is also t
he same as the ASCII value.
For characters equal to or below 2047 (hex 0x07FF), the UTF-8 representation is
spread across two bytes. The first byte will have the two high bits set and the
third bit clear (i.e. 0xC2 to 0xDF). The second byte will have the top bit set a
nd the second bit clear (i.e. 0x80 to 0xBF).
For all characters equal to or greater than 2048 but less that 65535 (0xFFFF), t
he UTF-8 representation is spread across three bytes.
The following table shows the format of such UTF-8 byte sequences (where the "fr
ee bits" shown by x's in the table are combined in the order shown, and interpre
ted from most significant to least significant).
Binary format of bytes in sequence
1st Byte
2nd Byte
3rd Byte
its
Maximum Expressible Unicode Value
0xxxxxxx
7
110xxxxx
10xxxxxx
1110xxxx
10xxxxxx
10xxxxxx
(65535)
11110xxx
10xxxxxx
10xxxxxx
10FFFF hex (1,114,111)
The value of each individual byte indicates its
00
80
C2
E0
F0
to
to
to
to
to
7F
BF
DF
EF
FF
hex
hex
hex
hex
hex
4th Byte
Number of Free B
(3+6+6+6)=21