Skip to main content

Posts

Showing posts from March, 2017

Unicode and UTF8 Encoding

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Unicode officially encodes 1,114,112 characters, from 0x000000 to 0x10FFFF. (The idea that Unicode is a 16-bit encoding is completely wrong.) For maximum compatibility, individual Unicode values are usually passed around as 32-bit integers (4 bytes per character), even though this is more than necessary. The consensus is that storing four bytes per character is wasteful, so a variety of representations have sprung up for Unicode characters. The most interesting one for C programmers is called UTF-8. UTF-8 is a "multi-byte" encoding scheme, meaning that it requires a variable number of bytes to represent a single Unicode value. Given a so-called "UTF-8 sequence", you can convert it to a Unicode value that refers to a character. http://www.cprogramming.com/tutorial/unicode.html There are 3 types of encoding in unicode, UT

Base64 Encoding

The base-64 encoding converts a series of arbitrary bytes into a longer sequence of common text characters that are all legal header field values. Base-64 encoding takes a sequence of 8-bit bytes, breaks the sequence into 6-bit pieces, and assigns each 6-bit piece to one of 64 characters comprising the base-64 alphabet. Base 64–encoded strings are about 33% larger than the original values. For example “Ow!” -> “T3ch” 1. The string “Ow!” is broken into 3 8-bit bytes (0x4F, 0x77, 0x21). 2. The 3 bytes create the 24-bit binary value 010011110111011100100001. 3. These bits are segmented into the 6-bit sequences 010011, 110111, 01110,100001.