UFT-8 encoding

You may often hear the sentence: a file is UTF-8 encoded. Occasionally, you may want to check if a file is UFT-8 encoded or not. In a UTF-8 encoded text file, all characters are UFT8-encoded. ASCII characters are encoded just as in ANSI encoded text file, which occupy one byte. Non-ascii characters are  encoded using 2-4 bytes.  Interestingly, you can tell if a byte is an ascii character or part of a non-ascii character by the byte itself. If the value of the byte (considering the byte as an unsigned integer) is less than 0x80,it is an ascii character. If the binary representation of the byte begins with 110,i.e., 110xxxxx, the byte is the first byte of a non-ascii character that is encoded with 2 bytes. If  the binary representation of the byte begins with 1110,i.e., 1110xxxx, the byte is the first byte of a non-ascii character that is encoded with 3 bytes.   If  the binary representation of the byte begins with 11110,i.e., 11110xxx, the byte is the first byte of a non-ascii character that is encoded with 4 bytes.  If the binary representation of the byte begins with 10,i.e.,10xxxxxx, it is one of the non-first character of a non-ascii character.

reference:http://blog.csdn.net/jiangqin115/article/details/42684017

Comments are closed, but trackbacks and pingbacks are open.