Compression, XML, Binary Infoset
Related link: http://www.intel.com/cd/ids/developer/asmo-na/eng/dc/pentium4/optimization/20833…
I found Chandra Mohan Lingam's good and accessible article 'HTTP Compression for Web Applications' at
>Intel's site. Age unknown. I think it is good material for anyone wanting to develop a critical perspective on the pro- and anti- 'XML Binary Infoset' technologies and standard under discussion at W3C.
The highlights are the numbers: compressing static content is a one-off process that more than halves CPU utilization at servers; dynamic compression adds 25% utilization; compression and halves 'page' download times. No mention of the cost of decompression at clients, but rather a claim that IO and network bandwidth is the bottleneck. I would be interested to see architectures and numbers to clarify when and how much this is true.
(What is particuarly interesting is the interaction between SSL and compression. In some cases, there was no CPU penalty in having compression and SSL compared to sending raw data; when dealing with a compressed file, the SSL had less to do.)
It strikes me that one good reason for "XML" Binary Infosets to succeed is the ease in which inefficient XML solutions can be deployed: inappropriately long element names, UTF-16 for essentially ASCII data, indented elements causing spurious node creation by parsers, slow and bloated libaries, and disabled compression. The XML guiding goal of Terseness is of minimal importance is a statement about layering, not a statement of general denial. XML evangelists may need to spend more time promoting an awareness of how and when XML can be efficiently processed or transmitted rather than kneejerking about the evils of binary. The present success of XML is so strong and deep, I think Binary Infoset can only gain ground in people who currently actually cannot use XML (for all sorts of good reasons); the people I meet really would prefer to stay with XML if they could.
Categories
WebRead More Entries by Rick Jelliffe.

Apologies
Oops, my previous comments were completely off the mark. I read something different to what Rick wrote, it seems. More here:
http://kontrawize.blogs.com/kontrawize/2005/12/comments_on_com.html
Sorry, Cheers, Tony.
We aren't all the same, part 2
I have put in a little reply there too, citing my blog item from Jan 2005 supporting Fast Infoset, http://www.oreillynet.com/pub/wlg/6206
Perhaps I should be clearer: I think there are people who need an efficient compressed XML and this should be standardized (e.g. through W3C); I think people who deny that anyone needs this are talking out of their hats if they make an absolute denial; however, I also think that XML efficiency can be improved quite a lot with better parsers, static compression and so on; and I think that XML evangelists who want to reduce the impact of XML Binary Infoset (out of concern it may break out of the cage and get a life of its own) would better spend their time improving high-performance XML rather than denying the need.
Just because with more efficient XML only 10,000 organizations might need XML Binary Infoset rather than 100,000 organizations, it doesn't mean that an agreed standard approach isn't worthwhile and to be discouraged or the need downplayed!
We aren't all the same, part 2
PS I have also written a bit more about this on my blog.
http://kontrawize.blogs.com/kontrawize/2005/11/yes_we_do_need_.html
Cheers, Tony.
We aren't all the same
XML has a broad spectrum of users. I think it is fair to say that most XML users have low bandwidth requirements. Either they don't sent document often over the network, or they don't send a lot of documents (as compared to the available network bandwidth). For that majority of users, textual encoding of XML is a great idea, because it's just easier to edit, debug, manage.
However, for those of us who work with clients whose requirements for XML would swamp the available network bandwidth if the XML was all sent as text, there is a real issue. A lot of my work is in finance where the data volumes are huge, and the available network bandwidth often isn't sufficient (it has to be shared with a lot of other network traffic as well). Although this is numerical minority of users, it is a big important market, and it deserves some attention.
What I have needed for years now is a standard binary encoding of XML that puts it on a par with ASN.1 (used for SWIFT messages). With ASN.1, it's completely normal to have a textual representation for users to read/edit/debug, and a binary format to send over the wire. Both formats are generated from a single syntax description, and the translation software is also generated automatically, so there isn't a great overhead to having both text and binary forms. Now, ASN.1 never caught on the way XML has, but there is no reason the same shouldn't apply, that we should be able to generate a binary equivalent to any textual XML message, and be able to generate the software to translate between the binary and XML formats.
Some people would say that you shouldn't use XML if you need a binary format, but that's avoiding the question. There is a huge base of tools around XML that doesn't exist for binary formats, so people need the chance to use and re-use those XML tools even when they need something that works in a low bandwidth context.
The MPEG-7 XML compression that has been around for some years is one I would have liked to see anointed by the W3C, but that never happened. The 'Fast Infoset' (http://java.sun.com/developer/technicalArticles/xml/fastinfoset/) is an interesting approach, using ASN.1 to supply the binary format that XML lacks.
It isn't just having a binary format that is the issue, it's having the W3C anoint it, and vendors support it, so that I can get clients to use it without feeling locked in to one vendor's implementation.
Whenever I read people argue against binary XML, I wonder whether they have really had to work in a high network data volume situation or not. I suspect a lot of them have not, even some of the most well known figures. It would be a shame if XML were replaced by something else in this important area just because low data volume users couldn't be see the point.
Cheers, Tony.