Binary XML? What about Unicode-aware CPUs instead?
Mike Champion asked on the XML-DEV mail list this week
To be honest the XBC (XML Binary Characterization) WG has been waiting for some community push-back
and input on both the negatives and positives of binary XML ... but so far the negatives
haven't been coming and that worries me a bit..
My guess is that most people think like me: if "Binary XML" is just a form of compression that allows a tightly coupled interface to SAX, then why not? Or if it allows some substantially different characteristic such as random access, then maybe it has a good place as an adjunct. If it does not have enough bang per buck, it will only be a niche thing. Not that there is anything wrong with niches. (And we should expect that companies who don't do well out of XML will try to spoil it and develop alternatives, while companies doing well from it will try to stifle innovation. That's show biz!) Does ASN.1 currently hurt XML?
I think that, post XML Schemas, the W3C brand is fairly diminished as far as new specs are concerned. XBC could easily go the way of XPointer, XML 1.1 and XML Fragment Interchange: like a quarrelsome but beautiful neighbour, decorative but to be avoided.
If the XBC discussion takes off, expect all the usual baloney, in particular the extremely fragrant sausage that if we reduce the number of different tags that XML recognizes, it will speed up parsers in some significant way. (Baloney because there are efficient ways of implementing parsers so that tests for rare tags don't penalize the common cases. For example, an optimized parser could detect that a document has no DOCTYPE declaration, and then switch to an implementation that does not need to do any buffer reallocation to handle entity inclusion. Java even provides jump tables to make simple parsing fast.)
My view from the armchair is that chip manufacturers (Intel, AMD, et al.) need to step up to the plate here: the Unicode character tables and properties, and Unicode transcoders for the most common characters sets, should be hardcoded inside CPU chips. I have seen at least one East Asian CPU with character tables built-in, so it is not a far-fetched idea. People do not say "Maths operations take a lot of CPU power, lets ditch less common math functions", do they?
Now that XML is ubiquitous and mission critical, of course we should expect all sorts of ingenious ways to speed it up. But the prime area that is being missed, it seems to me, is how to improve XML support inside CPUs.
What kind of form could it take? The simplest form might be to provide an operation that takes an unsigned short (i.e. a UTF-16 character) and returns an int containing bits representing each binary Unicode property and its status as an XML delimiter, just by simple table lookup. (Actually, I would provide two operations: one for UTF-16 BMP which also copes with ASCII and ISO 8859-1 because they are code compatible, and one for UTF-32.) Since XML documents tend to be small, for both SAX processing and XML->DOM processing, I quite expect that not much XML parser machine code would survive in the cache between invocations of the parser or SAX. So providing a built-in table will marginally improve cache behaviour as well as allowing faster parsing without giving up on decent and suspicious parsing: since IO between the CPU and bus is the current bottleneck, this improvement, though certainly limited and sporadic, is in the right kind of area.
If I were Intel or AMD, and looking for a way to add value to my CPUs, I would look into building the Unicode character tables especially to speed up XML processing. Derek Denny-Brown made a good point on XML-DEV: Most of the CPU cost of parsing is related to the abstract model of XML, not the text parsing: Duplicate attribute detection, character checking, namespace resolution/checking. Every binary-xml implementation I have researched which improves CPU utilization does so by skipping checks such as these. At that point you are no longer talking about XML.
Of course, Unicode is evolving. But nowadays only on the fringes, and really only outside the Basic Multilingual Plane (BMP: the first 64000 characters). XML delimiter-based parsing is quite cheap (at any one time, there are usually only two significant characters to look for: & or < in data content, "or & (or ' or &) in attribute values, ] in CDATA content, whitespace or > in tagnames, whitespace or = in attribute names.)
It is the characters that indicate malformed XML that add checking cost: finding !@#$%^*()_+={}[];;"',<?/ or other non-element character in an element or attribute name. XML pairs its Draconian error handling with trivial inspectability of the data: this is congenial for programmers, in comparison to a binary format which may not have enough hints to allow meaningful reconstruction of the file for inspection. (Add comment about babies and bathwater here.)
Perhaps the rise of East Asian economic power also may have some impact here: when most CPUs drove PCs with ASCII documents, there was little reason to think about hardware support for large-character-set property-tables. Now that everyone has converged on Unicode, notably XML, and that China/Korea/Japan/Taiwan are such big players, this might be a useful feature.
Categories
WebComments (5)
Read More Entries by Rick Jelliffe.

Unicode isn't dead
The current situation for Unicode in software is that APIs do not typically provide the latest version of Unicode. There is a lag. So using a CPU-provided table is no worse than the current situation: in fact, the lifespan of CPUs may be shorter than the lifespan of operating systems in many cases! People who need to use the very latest version of Unicode always need to make special arrangements. Inside the BMP things are highly stable now; outside the BMP there is some evolution.
But Liam is right that there are many possible optimizations and improvements possible, such as a kind of firmware update for the tables, or an alternative user-loadable table, and so on. The devil is in the details, I am not insisting on any particular instructions.
My point is just that saying "Some parts of XML are rather inefficient (to run on our CPUs)" is the same thing as saying "Our CPUs are rather inefficient (when it comes to handling some parts of XML)".
XSL Transformation in hardware
tkachenko blog
Unicode isn't dead
Unicode is far from unchanging. I suspect most applications and operating systems would need at the very least to support some form of update table. For example, Ethiopic script was added since 2.0, along with many more Kanji Characters.
On the other hand, I suppose UTF-8 to UCS-32 conversion might be useful, as might some regular expression string matching.
Liam
SVG Accelerator embedded in Display Drive IC
NEC had a working demo at SVGopen 2004.
SVG is a XML flavour...
possibilities
At the Applied XML conference last year, I sat next to an Intel engineer at lunch, whose name I have alas forgotten. He seemed moderately surprised by how much people complained about the cost of processing text, and said he'd be happy to work on circuits more text-oriented than the ones we currently have.
I've heard similar stories a few times at conferences, though none quite so bluntly encouraging. I think you've got a great point here, and hopefully there's a lab somewhere working to make such things reality.