Talk:Unicode

From Knowino
Jump to: navigation, search

Overview of Unicode: "One of the storage schemes (UTF-16) stores the code points of the BMP (plane 0) unchanged in hexadecimal form." – Really? Unchanged, as 16-bit integers, or two bytes, or something like that, OK. However, does UTF-16 specify also their "printable" form (hexadecimal, octal, or even decimal, or whatever)? --Boris Tsirelson 14:44, 13 July 2011 (EDT)

I'm not sure I understand your question, so the following may not be an answer to your question:
1. Unicode BMP defines a 16 bit number (code point) for each of the first 65536 unique characters. It is up to the "user agent" to render the code point as glyph. Most user agents have more than one font at their disposal, so extra information (which can be set by default) is necessary to select the font and therewith the glyph that is printed, or shown on the screen.
2. While UTF-8 transforms this code point to yet another binary form (because of the ASCII legacy), UTF-16 ignores legacy and simply stores the code point unchanged as 16 bits. If a user agent knows the file is in UTF-16, it can simply pick up the code point and render it without prior back-transformation. If the file is in UTF-8, first a back-transformation from the stored form to the code point must be performed before rendering of the glyph.
Does this answer your question and would it improve the article if I weaved this answer somehow into the article?--Paul Wormer 02:55, 14 July 2011 (EDT)
No, sorry, it does not. I agree with everything in your phrase quoted above, except for the word "hexadecimal". If we say, for instance, "octal" instead of "hexadecimal" in your phrase, will it mean the same? Is UTF-16 somehow related to hexadecimal more than (say) octal? --Boris Tsirelson 06:22, 14 July 2011 (EDT)
Hexadecimal is just a matter of hardware. Computer memory is accessed in bytes of 8 bits and bytes are divided into 2 groups of 4. As you may know ;-) 2^4 = 16 and hence a hexadecimal (base 16) system is convenient. Older hardware, such as the CDC, had words of 60 bits, and then an octal system (groups of 3 bits) was more convenient. --Paul Wormer 06:46, 14 July 2011 (EDT)
Hardware can make hex or oct more convenient for humans closely related to the hardware. However, several programming languages I use give me the choice, to write a number in hex or octal or decimal or binary, irrespective of the hardware (of which some of them are independent). Also the dump utility (on Linux) gives me the same freedom. This is why I doubt that Unicode (and/or UTF) specifies hexadecimal as standard (or recommended?) form of writing the codes. But even if they do, still, that is about human-readable expressions, not the internal code. And your phrase "stores the code points ... unchanged in hexadecimal form" sounds strange for me. They are stored internally, as a hardware-readable combination of bits; human readability is a separate trouble. --Boris Tsirelson 07:37, 14 July 2011 (EDT)
All Unicode tables are organized in hex (U+C1FE etc.). But at the end of the day, it is of course not the hex number that is stored, but the binary equivalent (bits are set). In HTML the code points may also be specified in decimal, but given an 8-bit byte structure, hex is the most compact and convenient representation of the bit strings. If I would write: "One of the storage schemes (UTF-16) stores the code points of the BMP (plane 0)—that are numbered uniquely by strings of 16 bits—unchanged in its 16-bit form", would that solve your problem?--Paul Wormer 08:01, 14 July 2011 (EDT)
Yes, it would. --Boris Tsirelson 08:51, 14 July 2011 (EDT)
Personal tools
Variants
Actions
Navigation
Community
Toolbox