Fun with Unicode
We were therefore quite surprised when a customer complained that it was not handling Chinese characters; worse, they had a screen shot demonstrating it:
The source file is a tab delimited text file with a mixture of European and Chinese characters, all in Unicode. As you can see from the screen shot, there is a problem of some sort with the “City” column. This lead to head scratching and debugging activity, until finally the penny dropped.
Before supplying the answer, a brief digression into what Unicode is, and is not.
Most readers will be aware that a character is stored in a computer as a binary number. For example, the character ‘A’ may be stored as the number 65, the character ‘1’ as the number 49, etc. One of the first standards for how characters should be stored was ASCII (American Standard Code for Information Interchange). The original standard only covered 127 characters: aphabetic (upper and lower case A-Z, a-z), numbers, punctuation, and some special values used as control codes. This was subsequently extended to cover 256 characters.
As languages other than English started to be used with computers, it rapidly it became apparent that even 256 characters was very restrictive. This lead to the development of “code pages“, where the same set of numbers – 0 to 255 – were used to mean different characters. For example, in the Latin (Western European) code page (1252) the number 192 is the character À; in the Greek code page (1253) the same number is the character Π. This raises an immediate problem – if somebody sends me a file that includes the number 192, should it be interpreted as ‘À’ or ‘Π’ – or, indeed, some other character?
This and similar considerations lead to the development of Unicode – a single standard designed to cover most of the worlds writing systems. Currently, Unicode defines 1,114,112 code points, or mappings between number and character. Continuing our example, the letter ‘À’ is U+00C1, and the character ‘Π’ is U+03A0.
In theory, this should remove all possibility for confusion – every character in every language will have a unique code point, which means that a file will only have one intrepretation. As usual, life is not quite that simple. To cover that large set of characters requires each number to be 4 bytes (32 bits) in length. The ASCII code set was only one byte long. This would mean that if a simple translation from ASCII to Unicode was done, every text file or text database would suddenly be four times larger. In addition, there are enormous legacy issues – how to handle all the files, programs and operating systems deigned around 8 bit characters?
So the transition from a small character set world to a Unicode world has been gradual and messy. In particular, the Unicode standard evolved with a set of encodings of the code points. Instead of the “numbers” in the Unicode set of code points being treated as a simple number, there are different ways of actually representing them. The two most common are UTF-8 and UTF-16. UTF-8 stores all the characters from the old ASCII format as compatible 8 bit (one byte) numbers, and numbers larger than that use a set of escape codes to define them. UTF-16 uses a single 16-bitcode unit to encode the most common 63K characters, and a pair of 16-bit code unites, called surrogates, to encode the 1M less commonly used characters in Unicode.
This means that when we receive a file and attempt to display it, the first question is: what encoding is being used? If you attempt to use the incorrect encoding, you will get incorrect ( and probably incomprehensible) rests.
This was the first thought when the customer presented their problem – perhaps somewhere in the sequence from reading the CSV text file to displaying it on the screen, an incorect choice on encoding was being made. Some time was spent carefully following each step of the chain – to no avail. All was correct.
Unicode Glyphs and Fonts
You may have noticed that the discussion so far has been silent on the question of fonts, and the display of characters. This is because Unicode itself is silent on the question of fonts. The Unicode code point for upper case ‘A’ is U+0041, but this code point may be displayed as Times Roman, Arial, WingDings, or any other font we choose. Unicode defines that the code point represents Latin Capital letter A, but does not constrain the graphical representation of the letter; this is the glyph.
Now consider a font; a font is a set of graphical representations of the numbers that make up the character set. It defines how tall the letter is, whether it has serifs or not, whether it is slanted, etc. Each code point is a different letter, and the font has to define how each and every letter is to be represented.
Finally, the answer
But Unicode has a milion plus code points. Does every font define characters for every code point?
The answer is no. Most of the fonts commonly loaded define a relatively small subset of the Unicode code points. In our case, the default font being used in the Inaport preview pane was MS Sans Serif, which we learnt after some Googling, is capable of displaying only about 2,300 of the Unicode characters. In particular, it is not capable of displaying the Chinese characters from the text file. Once this was realised, the fix was very easy – simply choose a font that can display the characters. The following screen shot shows exactly the same data, but using the font Arial Unicode MS:
The Chinese characters are now being displayed correctly.
You can get more information about the Unicode characters that various fonts can display from here: http://en.wikipedia.org/wiki/Unicode_typefaces#List_of_Unicode_fonts
More importantly, Inaport has now been updated to allow you to select the font being used to display data in the preview panes.
Postcript: Unicode and Windows 7
Windows 7 has dramatically enhanced the basic capabilities of many of its default fonts. So far we have not found good information, but when testing the display of characters on Windfows 7 systems we have discovered that a number of fonts that previously could not display the Chinese characters, now can. If you have more information on this, please let us know so that we can update this post.