Skip to main content

Unicode and UTF8 Encoding


Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Unicode officially encodes 1,114,112 characters, from 0x000000 to 0x10FFFF. (The idea that Unicode is a 16-bit encoding is completely wrong.) For maximum compatibility, individual Unicode values are usually passed around as 32-bit integers (4 bytes per character), even though this is more than necessary. The consensus is that storing four bytes per character is wasteful, so a variety of representations have sprung up for Unicode characters. The most interesting one for C programmers is called UTF-8. UTF-8 is a "multi-byte" encoding scheme, meaning that it requires a variable number of bytes to represent a single Unicode value. Given a so-called "UTF-8 sequence", you can convert it to a Unicode value that refers to a character. http://www.cprogramming.com/tutorial/unicode.html

There are 3 types of encoding in unicode,
  1. UTF -8
  2. UTF-16
  3. UTF-32

UTF -8
UTF-8 uses byte sequences of one to four bytes to represent the entire Unicode codespace. The number of bytes required depends upon the range in which a codepoint lies.
Here is the UTF-8 mapping for unicode codepoints

00000000 -- 0000007F:         0xxxxxxx
00000080 -- 000007FF:         110xxxxx 10xxxxxx
00000800 -- 0000FFFF:         1110xxxx 10xxxxxx 10xxxxxx
00010000 -- 001FFFFF:         11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Let's take an example and see how to extract the unicode value. Following a snippet from UTF-8 encoded string,




If you look at the above example, those bytes which starts with "E" belongs to the unicode range "00000800 -- 0000FFFF", that means any character in this range contains 3 bytes when converted to UTF-8.
  1. Lets take the first character and apply the mask
UTF-8 Mask
1110 xxxx   10xx xxxx    10xx xxxx
E6 9C 8D
1110 0110 1001 1100  1000 1101
Extracted Unicode Sequence
0110 01 1100  001101 = 670D

  1. U+670D is    http://unicode.scarfboy.com/?s=U%2B670D
  2. One more thing I would like to point is that, any file encoded with UTF-8 will have a Byte Order Mask written as the first few characters. For UTF-8 the BOM is 0xEF,0xBB,0xBF. Application can read this bytes to identify what encoding the file has used. (This is how notepad++ is able to show the chinese characters correctly).  https://en.wikipedia.org/wiki/Byte_order_mark

  1. UTF-16 is also similar encoding and you can see more about it here https://en.wikipedia.org/wiki/UTF-16
  2. http://docs.oracle.com/cd/E23943_01/bi.1111/b32121/pbr_nls003.htm#RSPUB23729  (encoding details)

Comments

Popular posts from this blog

How to find locked binaries as part of upgrade/fresh installation?

How to find locked binaries as part of upgrade/fresh installation? When you upgrade an application using windows installer many a time you might have come across issues like the files which you wanted to overwrite or delete is already in use or in another way some other application is already using that binary. In such cases, the windows installer will show a FilesInUse dialog. However, the problem here is this dialog will show only the application name which is consuming the binary and not the actual binary name. This has 2 problems, in case of a simple application which a handful of binaries we can easily figure out the binary getting locked, however in case of a large application with several binaries and run times it might be tricky to find out such locked binaries. The problem gets even more complicated if this scenario occurs in an environment where you don't have access, for example, a customer environment. Let me briefly explain how installer identifies and shows th...

Docker in Linux

Docker Installation Need 64bit machine and follow the steps available in below link, https://docs.docker.com/installation/ubuntulinux/ What is Docker? Docker is a tool that promises to easily encapsulate the process of creating a distributable artifact for any application, deploying it at scale into any environment, and streamlining the workflow and responsiveness of agile software organizations. In a nutshell, here's what Docker can do for you: It can get more applications running on the same hardware than other technologies; it makes it easy for developers to quickly create, ready-to-run containered applications; and it makes managing and deploying applications much easier. Difference between hypervisor and containers The key difference between containers and VMs is that while the hypervisor abstracts an entire device, containers just abstract the operating system kernel. They are much more efficient than hypervisors in system resource terms. Instead of ...