Unicode and UTF8 Encoding

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Unicode officially encodes 1,114,112 characters, from 0x000000 to 0x10FFFF. (The idea that Unicode is a 16-bit encoding is completely wrong.) For maximum compatibility, individual Unicode values are usually passed around as 32-bit integers (4 bytes per character), even though this is more than necessary. The consensus is that storing four bytes per character is wasteful, so a variety of representations have sprung up for Unicode characters. The most interesting one for C programmers is called UTF-8. UTF-8 is a "multi-byte" encoding scheme, meaning that it requires a variable number of bytes to represent a single Unicode value. Given a so-called "UTF-8 sequence", you can convert it to a Unicode value that refers to a character. http://www.cprogramming.com/tutorial/unicode.html

There are 3 types of encoding in unicode,

UTF -8
UTF-16
UTF-32

UTF -8

UTF-8 uses byte sequences of one to four bytes to represent the entire Unicode codespace. The number of bytes required depends upon the range in which a codepoint lies.

Here is the UTF-8 mapping for unicode codepoints

00000000 -- 0000007F: 0xxxxxxx

00000080 -- 000007FF: 110xxxxx 10xxxxxx

00000800 -- 0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx

00010000 -- 001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Let's take an example and see how to extract the unicode value. Following a snippet from UTF-8 encoded string,

If you look at the above example, those bytes which starts with "E" belongs to the unicode range "00000800 -- 0000FFFF", that means any character in this range contains 3 bytes when converted to UTF-8.

Lets take the first character and apply the mask

UTF-8 Mask	1110 xxxx 10xx xxxx 10xx xxxx
E6 9C 8D	1110 0110 1001 1100 1000 1101
Extracted Unicode Sequence	0110 01 1100 001101 = 670D

U+670D is 服 http://unicode.scarfboy.com/?s=U%2B670D
One more thing I would like to point is that, any file encoded with UTF-8 will have a Byte Order Mask written as the first few characters. For UTF-8 the BOM is 0xEF,0xBB,0xBF. Application can read this bytes to identify what encoding the file has used. (This is how notepad++ is able to show the chinese characters correctly). https://en.wikipedia.org/wiki/Byte_order_mark

UTF-16 is also similar encoding and you can see more about it here https://en.wikipedia.org/wiki/UTF-16
http://docs.oracle.com/cd/E23943_01/bi.1111/b32121/pbr_nls003.htm#RSPUB23729 (encoding details)

Tech Tips

Search This Blog

Unicode and UTF8 Encoding

Comments

Post a Comment

Popular posts from this blog

How to find locked binaries as part of upgrade/fresh installation?

Base64 Encoding