Skip to main content

Unicode and UTF8 Encoding


Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Unicode officially encodes 1,114,112 characters, from 0x000000 to 0x10FFFF. (The idea that Unicode is a 16-bit encoding is completely wrong.) For maximum compatibility, individual Unicode values are usually passed around as 32-bit integers (4 bytes per character), even though this is more than necessary. The consensus is that storing four bytes per character is wasteful, so a variety of representations have sprung up for Unicode characters. The most interesting one for C programmers is called UTF-8. UTF-8 is a "multi-byte" encoding scheme, meaning that it requires a variable number of bytes to represent a single Unicode value. Given a so-called "UTF-8 sequence", you can convert it to a Unicode value that refers to a character. http://www.cprogramming.com/tutorial/unicode.html

There are 3 types of encoding in unicode,
  1. UTF -8
  2. UTF-16
  3. UTF-32

UTF -8
UTF-8 uses byte sequences of one to four bytes to represent the entire Unicode codespace. The number of bytes required depends upon the range in which a codepoint lies.
Here is the UTF-8 mapping for unicode codepoints

00000000 -- 0000007F:         0xxxxxxx
00000080 -- 000007FF:         110xxxxx 10xxxxxx
00000800 -- 0000FFFF:         1110xxxx 10xxxxxx 10xxxxxx
00010000 -- 001FFFFF:         11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Let's take an example and see how to extract the unicode value. Following a snippet from UTF-8 encoded string,




If you look at the above example, those bytes which starts with "E" belongs to the unicode range "00000800 -- 0000FFFF", that means any character in this range contains 3 bytes when converted to UTF-8.
  1. Lets take the first character and apply the mask
UTF-8 Mask
1110 xxxx   10xx xxxx    10xx xxxx
E6 9C 8D
1110 0110 1001 1100  1000 1101
Extracted Unicode Sequence
0110 01 1100  001101 = 670D

  1. U+670D is    http://unicode.scarfboy.com/?s=U%2B670D
  2. One more thing I would like to point is that, any file encoded with UTF-8 will have a Byte Order Mask written as the first few characters. For UTF-8 the BOM is 0xEF,0xBB,0xBF. Application can read this bytes to identify what encoding the file has used. (This is how notepad++ is able to show the chinese characters correctly).  https://en.wikipedia.org/wiki/Byte_order_mark

  1. UTF-16 is also similar encoding and you can see more about it here https://en.wikipedia.org/wiki/UTF-16
  2. http://docs.oracle.com/cd/E23943_01/bi.1111/b32121/pbr_nls003.htm#RSPUB23729  (encoding details)

Comments

Popular posts from this blog

Base64 Encoding

The base-64 encoding converts a series of arbitrary bytes into a longer sequence of common text characters that are all legal header field values. Base-64 encoding takes a sequence of 8-bit bytes, breaks the sequence into 6-bit pieces, and assigns each 6-bit piece to one of 64 characters comprising the base-64 alphabet. Base 64–encoded strings are about 33% larger than the original values. For example “Ow!” -> “T3ch” 1. The string “Ow!” is broken into 3 8-bit bytes (0x4F, 0x77, 0x21). 2. The 3 bytes create the 24-bit binary value 010011110111011100100001. 3. These bits are segmented into the 6-bit sequences 010011, 110111, 01110,100001.

How to find locked binaries as part of upgrade/fresh installation?

How to find locked binaries as part of upgrade/fresh installation? When you upgrade an application using windows installer many a time you might have come across issues like the files which you wanted to overwrite or delete is already in use or in another way some other application is already using that binary. In such cases, the windows installer will show a FilesInUse dialog. However, the problem here is this dialog will show only the application name which is consuming the binary and not the actual binary name. This has 2 problems, in case of a simple application which a handful of binaries we can easily figure out the binary getting locked, however in case of a large application with several binaries and run times it might be tricky to find out such locked binaries. The problem gets even more complicated if this scenario occurs in an environment where you don't have access, for example, a customer environment. Let me briefly explain how installer identifies and shows th...