Digital Media Web Blogs > Web

Unicode has too many characters


Related link: http://www.hackcraft.net/xmlUnicode/


John Hanna's introduction to Unicode and XML, like most "http://tbray.org/ongoing/When/200x/2003/04/06/Unicode"
>good introductions
skirts the dirty secret:
Unicode—or, more specifically, its encodings
UTF-8 and UTF-16—gives us exciting new opportunites for corrupting data. Text is broken.

At the moment, the web uses an ad hoc mix of defaults (ASCII for pre-90s standards, ISO 8859-1 for early 90s standards, UTF-8 for recent standards), out-of-data headers (such as MIME headers), voluntary in-data signals (such as HTML's meta tag), magic numbers (such as XML's encoding header), browser and server settings, hidden attributes (such as on HTML forms) and guesswork (such as browsers often use). This is rubber-banded together in a nebulous hierarchy that some labels, defaults, etc. could be trusted in preference to other labels, defaults, etc.

All that is difficult enough, but often we programmers do not even know which encoding our programs read or write text files as: the default in Java, for example, is to read and write text in the default encoding of the particular platform in its current locale. What is the default encoding on your current computer? What encoding is used by your DBMS? What encoding does data sent from an HTML form use?

Without Unicode encodings, this adhocery hangs together enough so that people using the dominant platform in the dominant regions usually do not notice much of a problem.

Yet even without the addition of UTF-8 and UTF-16, many people will have experienced the common problem where you make a webpage with an em-dash or quotes on your Windows or older Mac system, only to find the dash or quotes gone awry when read on a different system. The cause? Documents not correctly labelled with their encodings coupled with systems that cannot figure out that the documents are wrongly labelled.


How could it be fixed? One way is for everyone to use UTF-8 and UTF-16: that might be a good goal for 2010.
Why not?
Another way would be for all file systems to allow metadata so that there is an unbroken metadata channel from API to data storage to server to HTTP to web agent: MicroSoft seems to be taking a step forward in this regard with WinFS just as Apple has taken a step back by adopting UNIX files.

But there is another way: follow XML's emerging example:—

  • Send text with the MIME content type of application/* rather than text/*, so that application-specific defaults are not used: no mystery;
  • Put the encoding of the file in some header at the top of the file: explicit labelling
  • Character set transcoding libraries should barf loudly when an erroneous code sequence is found, and not just swallow the codes or replace them with "?": expose corruption;
  • APIs should take care of this for grassroots
    programmers: don't burden folk with complications; and

  • With all this potential for data corruption,
    text formats need to make use of code redundancy,
    to catch certain mislabelling problems that cannot
    be detected by a vanilla transcoder:
    critical systems require robustness.

This last point has only recently emerged as being important, and underlies the recent draft XML 1.1.


There are some critical code points which let us detect when our text file is not in the encoding we thought it was in.
In
engineering jargon, by creating code redundancy we allow error detection.


In particular the C1 block of control characters from
U+0080 to U+009F need to be sacrificed,
disallowed from use by text formats, so that single-byte based encodings which use the bytes 0x80 to 0x9F will be discovered, when mislabelling has occurred. They are critical. So a system that expects ISO 8859-1 will spit the dummy if presented with Cp1252 ("ANSI") text, for example.

Not everyone likes XML's approach. Amelia Lewis recently gave the argument on XML-DEV that in-band signals of encoding, such as magic numbers of various kinds, are a hack because they require a pre-parse of the data stream, at least in Java. I expect this will go away, not only because of hackers getting bored with rolling their own XML parser, but also because the Java NIO architecture allows autodetecting transcoders: with such a transcoder one could open up an XML file with the encoding "XML-autodetect", for example, and not need to pre-parse data. That being the case, we need some generalization of the XML header that can be applied to other text format readily: my xtext is one idea for that, if anyone wants to get on board.

After five years, XML is still state-of-the-art on this; other text formats would do well to adopt its approach. Allowing UTF-8 and UTF-16 in addition to existing encodings can make a confusing situation worse, unless we also adopt simple harm-reduction measures like the ones suggested above.


Can text be fixed?

Categories





AddThis Social Bookmark Button




Read More Entries by Rick Jelliffe.

Topics of Interest

Related Books

Archives


 
 


Or, visit our complete archive.  

Stay Connected