unicode,ucs-2,ucs-4,utf-8,utf-16,utf-32,big endian,little endian

Of course I know what big endian and little endian are. The problem is I cannot memorize the difference between them. The cause is I misunderstand the meaning of the term “endian”. Here “end” does not mean “tail” but “head”, so “big endian” means a big head while “little endian” means a little head. When we write a number on paper, it is always “big endian”, i.e., the most significant(big) byte is at the beginning(head). As to the memory  layout of the number, if the most significant(big) is at the low(beginning) address of memory, we also call it big-endian. If the least significant byte is put at the low address, we call it little endian. Here the “end” means the start(low,beginning) address of memory.

Unicode is the universal code point of a character. ucs-2 specifies the code points below 0xffff while ucs-4 specifies the code points above 0xffff and below 0xffffffff as well.

utf-8/utf-16/utf-32 are the methods to store/transfer the code point. utf-16be is using 2 bytes to store a  code point with big-endian. If the code point is above 0xffff, utf-16be will store it using 4 bytes. For example, code point 0x1F64B is stored as “D83D DE4B”. Similarly, utf-16le is using 2 bytes to store/transfer a code point with little-endian. utf-16 uses BOM to indicate whether the data is stored as big-endian or little-endian. If BOM is FEFF, the data is stored big endian. If BOM is FFFE, the data is stored as little-endian. Here you can see the inventor of BOM considers the meaning of “endian” as “tail”. He puts the smaller number FE at the end to mean it is little endian. He puts the bigger number FF at the end to indicate it is big-endian. UTF-32 uses 4 bytes to store a code point so it can store all the ucs-4 code points without further encoding like utf-16, i.e., 0x1F64B is stored as “0001f64b”. Like utf-16be and utf-16le, there are also utf-32be and utf-32le. utf-8 is more complicated. It uses 1-4 bytes to store a code point. For example, for code points of us ascii characters, only one byte is used.  Read my post for more about utf-8.

If you like my content, please consider buying me a coffee. Buy me a coffeeBuy me a coffee Thank you for your support!

Leave a Reply