You are on page 1of 2

Summary

UTF-8 is a compromise character encoding that can be as compact as ASCII (if the
file is just plain English text) but can also contain any unicode characters (w
ith some increase in file size).
UTF stands for Unicode Transformation Format. The '8' means it uses 8-bit blocks
to represent a character. The number of blocks needed to represent a character
varies from 1 to 4.
One of the really nice features of UTF-8 is that it is compatible with nul-termi
nated strings. No character will have a nul (0) byte when encoded. This means th
at C code that deals with char[] will "just work".
You can try the UTF-8 Test Page to see how well your browser (and default font)
support UTF-8.
If you are an application developer, this Joel On Software article on Unicode is
pretty good summary of all you need to know.
More links:
If you are into the gory details, the official spec is RFC 3629
Markus Kuhn's FAQ
Rob Pike's story about the invention of it
Detail
For any character equal to or below 127 (hex 0x7F), the UTF-8 representation is
one byte. It is just the lowest 7 bits of the full unicode value. This is also t
he same as the ASCII value.
For characters equal to or below 2047 (hex 0x07FF), the UTF-8 representation is
spread across two bytes. The first byte will have the two high bits set and the
third bit clear (i.e. 0xC2 to 0xDF). The second byte will have the top bit set a
nd the second bit clear (i.e. 0x80 to 0xBF).
For all characters equal to or greater than 2048 but less that 65535 (0xFFFF), t
he UTF-8 representation is spread across three bytes.
The following table shows the format of such UTF-8 byte sequences (where the "fr
ee bits" shown by x's in the table are combined in the order shown, and interpre
ted from most significant to least significant).
Binary format of bytes in sequence
1st Byte
2nd Byte
3rd Byte
its
Maximum Expressible Unicode Value
0xxxxxxx
7
110xxxxx
10xxxxxx
1110xxxx
10xxxxxx
10xxxxxx
(65535)
11110xxx
10xxxxxx
10xxxxxx
10FFFF hex (1,114,111)
The value of each individual byte indicates its
00
80
C2
E0
F0

to
to
to
to
to

7F
BF
DF
EF
FF

hex
hex
hex
hex
hex

4th Byte

Number of Free B

007F hex (127)


(5+6)=11
07FF hex (2047)
(4+6+6)=16
FFFF hex
10xxxxxx

(3+6+6+6)=21

UTF-8 function, as follows:

(0 to 127): first and only byte of a sequence.


(128 to 191): continuing byte in a multi-byte sequence.
(194 to 223): first byte of a two-byte sequence.
(224 to 239): first byte of a three-byte sequence.
(240 to 255): first byte of a four-byte sequence.

UTF-8 remains a simple, single-byte, ASCII-compatible encoding method, as long a


s no characters greater than 127 are directly present. This means that an HTML d
ocument technically declared to be encoded as UTF-8 can remain a normal single-b
yte ASCII file. The document can remain so even though it may contain Unicode ch
aracters above 127, as long as all characters above 127 are referred to indirect
ly by ampersand entities.
Examples of encoded Unicode characters (in hexadecimal notation)
16-bit
0001
007F
0080
07FF
0800
FFFF
010000
10FFFF

Unicode UTF-8 Sequence


01
7F
C2 80
DF BF
E0 A0 80
EF BF BF
F0 90 80 80
F4 8F BF BF

You might also like