Skip to main content

Unicode and UTF8 Encoding


Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Unicode officially encodes 1,114,112 characters, from 0x000000 to 0x10FFFF. (The idea that Unicode is a 16-bit encoding is completely wrong.) For maximum compatibility, individual Unicode values are usually passed around as 32-bit integers (4 bytes per character), even though this is more than necessary. The consensus is that storing four bytes per character is wasteful, so a variety of representations have sprung up for Unicode characters. The most interesting one for C programmers is called UTF-8. UTF-8 is a "multi-byte" encoding scheme, meaning that it requires a variable number of bytes to represent a single Unicode value. Given a so-called "UTF-8 sequence", you can convert it to a Unicode value that refers to a character. http://www.cprogramming.com/tutorial/unicode.html

There are 3 types of encoding in unicode,
  1. UTF -8
  2. UTF-16
  3. UTF-32

UTF -8
UTF-8 uses byte sequences of one to four bytes to represent the entire Unicode codespace. The number of bytes required depends upon the range in which a codepoint lies.
Here is the UTF-8 mapping for unicode codepoints

00000000 -- 0000007F:         0xxxxxxx
00000080 -- 000007FF:         110xxxxx 10xxxxxx
00000800 -- 0000FFFF:         1110xxxx 10xxxxxx 10xxxxxx
00010000 -- 001FFFFF:         11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Let's take an example and see how to extract the unicode value. Following a snippet from UTF-8 encoded string,




If you look at the above example, those bytes which starts with "E" belongs to the unicode range "00000800 -- 0000FFFF", that means any character in this range contains 3 bytes when converted to UTF-8.
  1. Lets take the first character and apply the mask
UTF-8 Mask
1110 xxxx   10xx xxxx    10xx xxxx
E6 9C 8D
1110 0110 1001 1100  1000 1101
Extracted Unicode Sequence
0110 01 1100  001101 = 670D

  1. U+670D is    http://unicode.scarfboy.com/?s=U%2B670D
  2. One more thing I would like to point is that, any file encoded with UTF-8 will have a Byte Order Mask written as the first few characters. For UTF-8 the BOM is 0xEF,0xBB,0xBF. Application can read this bytes to identify what encoding the file has used. (This is how notepad++ is able to show the chinese characters correctly).  https://en.wikipedia.org/wiki/Byte_order_mark

  1. UTF-16 is also similar encoding and you can see more about it here https://en.wikipedia.org/wiki/UTF-16
  2. http://docs.oracle.com/cd/E23943_01/bi.1111/b32121/pbr_nls003.htm#RSPUB23729  (encoding details)

Comments

Popular posts from this blog

Base64 Encoding

The base-64 encoding converts a series of arbitrary bytes into a longer sequence of common text characters that are all legal header field values. Base-64 encoding takes a sequence of 8-bit bytes, breaks the sequence into 6-bit pieces, and assigns each 6-bit piece to one of 64 characters comprising the base-64 alphabet. Base 64–encoded strings are about 33% larger than the original values. For example “Ow!” -> “T3ch” 1. The string “Ow!” is broken into 3 8-bit bytes (0x4F, 0x77, 0x21). 2. The 3 bytes create the 24-bit binary value 010011110111011100100001. 3. These bits are segmented into the 6-bit sequences 010011, 110111, 01110,100001.

How to configure Microsoft SQL Server in EAP/Wildfly

Wednesday, December 11, 2019 9:19 AM Download the appropriate jdbc driver from https://docs.microsoft.com/en-us/sql/connect/jdbc/download-microsoft-jdbc-driver-for-sql-server?view=sql-server-ver15 Extract the driver package to the modules folder under EAP Here is from where my EAP is running C:\Users\admin\EAP-7.0.0\ Create a folder named "sqlserver" under C:\Users\admin\EAP-7.0.0\modules\system\layers\base\com\microsoft Copy the extracted driver jar file (for example mssql-jdbc-6.4.0.jre8.jar) to C:\Users\admin\EAP-7.0.0\modules\system\layers\base\com\microsoft\main Now create a modules.xml file in   C:\Users\admin\EAP-7.0.0\modules\system\layers\base\com\microsoft\main with the following settings <?xml version="1.0" encoding="UTF-8"?> <module xmlns="urn:jboss:module:1.3" name="com.microsoft.sqlserver">     <resources>         <resource