Unicode provides a
unique number for every character, no matter what the platform, no matter what
the program, no matter what the language. Unicode officially encodes 1,114,112
characters, from 0x000000 to 0x10FFFF. (The idea that Unicode is a 16-bit encoding
is completely wrong.) For maximum compatibility, individual Unicode values are
usually passed around as 32-bit integers (4 bytes per character), even though
this is more than necessary. The consensus is that storing four bytes per
character is wasteful, so a variety of representations have sprung up for
Unicode characters. The most interesting one for C programmers is called UTF-8.
UTF-8 is a "multi-byte" encoding scheme, meaning that it requires a
variable number of bytes to represent a single Unicode value. Given a so-called
"UTF-8 sequence", you can convert it to a Unicode value that refers
to a character. http://www.cprogramming.com/tutorial/unicode.html
There are 3 types of
encoding in unicode,
- UTF -8
- UTF-16
- UTF-32
UTF -8
UTF-8 uses byte
sequences of one to four bytes to represent the entire Unicode codespace. The
number of bytes required depends upon the range in which a codepoint lies.
Here is the UTF-8
mapping for unicode codepoints
00000000 --
0000007F: 0xxxxxxx
00000080 --
000007FF: 110xxxxx 10xxxxxx
00000800 --
0000FFFF: 1110xxxx 10xxxxxx
10xxxxxx
00010000 --
001FFFFF: 11110xxx 10xxxxxx
10xxxxxx 10xxxxxx
Let's take an
example and see how to extract the unicode value. Following a snippet from
UTF-8 encoded string,
If you look at the
above example, those bytes which starts with "E" belongs to the
unicode range "00000800 -- 0000FFFF", that means any character in
this range contains 3 bytes when converted to UTF-8.
- Lets take the first character and apply the mask
UTF-8 Mask
|
1110 xxxx 10xx xxxx 10xx xxxx
|
E6 9C 8D
|
1110 0110 1001 1100 1000 1101
|
Extracted Unicode
Sequence
|
0110 01 1100 001101 = 670D
|
- U+670D is 服 http://unicode.scarfboy.com/?s=U%2B670D
- One more thing I would like to point is that, any file encoded with UTF-8 will have a Byte Order Mask written as the first few characters. For UTF-8 the BOM is 0xEF,0xBB,0xBF. Application can read this bytes to identify what encoding the file has used. (This is how notepad++ is able to show the chinese characters correctly). https://en.wikipedia.org/wiki/Byte_order_mark
- UTF-16 is also similar encoding and you can see more about it here https://en.wikipedia.org/wiki/UTF-16
- http://docs.oracle.com/cd/E23943_01/bi.1111/b32121/pbr_nls003.htm#RSPUB23729 (encoding details)
Comments
Post a Comment