Adding XML 1.1 to a Java Application
XML 1.1 is fairly controversial: it just adds some niche features that most people don't need. We recently added some XML 1.1 support to Topologi's XML editor; I was surprised by how straightforward it was, but also perhaps premature.
Easier that I expected...
XML 1.1 has four aspects: NELS (a newline on some IBM mainframes), extra name characters (some pretty obscure characters), improved rules for control characters (more controls allowed, but safer because they must be references not literals), and coping with the different version number. Adding support for converting NEL to newlines on text import is trivial. Adding support for the extra name characters was straightforward because our editor was "broadminded" on those issues anyway (but see below on surrogates), ditto with controls, and the version-up of the XML version number causes no problems at this stage. So moving our product to XML 1.1 was not disruptive or difficult.
We use the latest version of Xerces (customized for error messages and a couple of tweaks) for XML processing. The latest version of Xerces can detect and switch to XML 1.1, and we didn't find any problems.
We use IBM's open source ICU4J for internationalization classes and normalization, because these track the latest version of Unicode better than Java 1.4.2. The trick with ICU4J is to strip it down to the smallest JAR possible: their site describes how to do this. Again, this was straightforward if tedious (I hope the ICU4J team will one day provide every permutation of JAR online for download, to save users' time.)
The biggest scary issue in XML 1.1 is that it moves beyond using just Unicode characters that fit into sixteen bit UTF-16 characters, as used internally by Java. UTF-16 has a kind of trick to cope with those kind of characters: there are two dedicated ranges of code points in UTF-16 that, when combined, map to characters that are greater than U+FFFF. These ranges are the surrogate characters. This are not nearly as odd as it seems, because actually many characters already require more than one Unicode code point to represent.
To support surrogates in an editor you need to do three things: first, make sure that your underlying APIs support the most recent version of Unicode and that you pass around Strings (or, at least, CharSequence or Segment) rather than chars; second, that your navigation and editing operations do not let you mess up a surrogate pair; and third that the correct fonts are available and your metrics understand surrogates.
..but premature?
And it is this last step where everything falls apart, at the moment. There are almost no fonts available that actually have the characters that surrogates might point to. Microsoft have one that is licensed for use with a particular product, and there is the open source Code2001 font, but apart from that almost nothing. (In fact, the absense of a variety of fonts made it impossible for us to have confidence that Java's proportional font metrics are 100% right for the surrogate ranges: that is another story.)
I think this is pretty telling, and is why we needn't be too hysterical or proactive about XML 1.1. Only people who set up their computers specially (or who have applications or runtimes specifically preconfigured with these fonts) can use the surrogates anyway. And only people dealing with importing certain mainframe data need worry about the NEL feature. So we just won't see a flood of XML 1.1 data. (In fact, by the time we finished adding XML 1.1 support to our product, we had convinced ourselves that probably no-one would use 1.1 anyway.)
XML 1.1 the 'XML Stack'
One thing XML 1.1 does expose is that XML's simple n.n versioning system didn't come with enough policy to make it convenient: it would be better if there were a policy such as "XML processors must fail with a WF error when there is a different major number; XML processors must not fail only because of a difference in the minor number", so that an XML 1.0 processor would only fail on an XML 1.1 document when there is some syntax apart from the version number that fails. (Other standards groups should heed XML's version failure here.)
IBM's Noah Mendelson recently spoke at the W3C meeting about this: Making the XML Stack work with XMl 1.1. Now Noah has two habits of thought that you can time your watch by: first he really wants to think that XML provides guaranteed interoperabilty (it only does if you are conservative), and second he is quite quick to invoke the bogeymen of Unimagined Complications. These two habits can conspire so that he frequently is (or adopts the role of being) extremely loathe to move from any established position: lets not fix what isn't broken. In the case of XML 1.1 and protocol stacks, I think Noah should be a little conflicted: adopting XML 1.1 may have some value (IBM asked for it), but then again it requires change to other parts of the stack.
My take on whether the "XML stack" should adopt XML 1.1? The inadequate definition in XML of how to treat the minor number has fatally compromised XML 1.1's deployability. So "XML 1.1 only" is not a workable option; "XML 1.0 only" is workable, but "limited XML 1.0 guaranteed; any XML 1.0 or 1.1 available but not guaranteed" is best. XML is so useful because it has both this reliable core for general use and also an outer layer that increases its convenience and usefulness for particular, maybe 'niche' tasks. XML has demonstrated that developers are smart enough to stick to the core when they need to—they don't need subsets of standards nannying them.
Categories
WebRead More Entries by Rick Jelliffe.
