Digital Media Web Blogs > Web

Java's XPath APIs needs grouting


I have been snuffling around various Java APIs relating to XPath this week in order to implement draft ISO Schematron in Java. But the current batch of Java APIs seem to be missing the functions I need: this must be holding back other developers too.

  • XML is a great notation compared to some binary data formats because the information typically comes pre-sorted: meta-data before primary data, primary data before secondary data; documents that are organized sequentially to follow the access order of a typical* client program reduce the best-case latency (the time from when a document starts to arrive until the earliest time the application starts to produce output in response.)

  • But this streaming advantage disappears when using DOM (in its current implemations, and I gather than XOM and JDOM are the same). What we need in the Java Standard Edition is a JIT implementation of DOM which allows early blocking read-access: a load function that returns as soon as the first element+attributes are available, blocking methods for mooching around the DOM, and perhaps an error listener for when the document is not WF or valid. If you just wanted to access the DOM in document order, the latency would be similar to using SAX: when a node arrives, you can use it straight away.


  • The various Java libraries around does not seem to have a way to ask, of a node in a DOM, "Do you match this XPath?", except by the brute force way of iterating through the document finding all nodes described by the XPath until you find the node.

  • This cripples XPath and forces us to write our code in a certain way (or, more likely, to abandon XPath). Instead, we need an XPath implementation that rewrites the XPath to start from a given node then see if its conditions match.


  • Finally, the Java DOM libraries do not seem to have a way to iterate through a document according to various stragies and dispatch to various functions according to which of several XPaths has been matched: .NET apparantly has functions for this kind of thing.

  • Are we supposed to use XSLT calling external Java functions for this? Here is the kind of thing I would like for a dispatching API:



    DomDispatchingIterator d =
    new DomDispatchingIterator( someDom, TOPDOWN,
    {"/x/y/z", 1, "/x/z", 2, "c[@d], 3 } ) ;

    while(d.getNext()){
    switch(d.getPathCode()) {
    case 1:
    // handle /x/y/z
    break;
    case 2:
    // handle /x/z
    break;
    case 3:
    // handle c[@d]
    break;
    }
    }

    * Some people will scream "But in XML you don't/shouldn't/daren't/dursen't/won't have a typical application in mind. Well, I think people often do: it impacts design choices. And then by publishing a DTD or schema, designers promote applications to follow that typical structure.

    Does anyone know any Java libraries that provides these functions?

    Categories





    AddThis Social Bookmark Button



    Comments (12)
    Read More Entries by Rick Jelliffe.

    12 Comments

    uche said:

    Schematron mailing list dead
    Well, I got to the finish line last month, but the track then disappeared out from under us all.

    The Schematron mailing list archives link from http://lists.sourceforge.net/lists/listinfo/schematron-love-in
    leads to an sF login page (already strange). When I log in I get a "permission denied" page.

    --Uche

    rjelliffe said:

    Race ya, Rick :-)
    Well, unless Java grows something fast, you may win Uche! Actually, I have not checked out SAXON properly yet. Maybe that is clearer. The Xalan code has several things that look potentially useful-ish, but the documentation is horrible, the code has "this is not optimized" messages, and it smells imprudent to write code based on the methods marker public, but which probably could change in the future. (Java visibility is no warrant of an APIs stability!)

    rjelliffe said:

    Nobody nodes the trouble I've seen
    Oh, sure, you need to dynamically re-evaluate the effective context query and the the effective assertions. And if there are any // starting from root, you may be in for a full traversal anyway! That is all par for the course. But you don't want to be doing that for every node in the document, just for the ones that look interesting.

    bazzargh said:

    Nobody nodes the trouble I've seen
    We do something similar, but by only revalidating fragments (this is ok for our limited use cases). However, I think there's another problem with this in ISO schematron, thats not there in older versions - global let expressions. If a 'let' sets a variable for the schema that references a node you've modified, all nodes need to be revalidated?

    Anyway, your original point is right, some things are too hard with java's XPath APIs. I guess I should join the love-in to discuss the schematron stuff...

    rjelliffe said:

    Nobody nodes the trouble I've seen
    I probably haven't flapped my use-case around enough to explain why I want backwards XPaths from a context node: I have a large document in a DOM; I make a change to the DOM; I want to check that any new or changed nodes are still valid against a Schematron schema (using path expressions), in a straightforward but efficient manner.

    What I want to avoid is iterating over the whole document again. (We may be talking documents with scores of Meg.) It is not really an issue of streaming versus random-access, though that also can have an impact on the initial validation.

    uche said:

    Race ya, Rick :-)
    As I mentioned on schematron-love-in, I just decided this week to write a Python ISO Schematron impl. I think Python is much superior to Java for this sort of thing (though I'll use XPattern rather than XPath for pattern dispatch because I don't think full XPath makes sense for it).

    Anyway, good luck in the effort. It's about time ISo Stron got some implementations.

    --Uche

    davidc2 said:

    xslt pattern
    do you need this ability for an arbitrary XPath?
    Your requirement description sounds like XSLT pattern matching, and XSLT patterns are a rather
    severely restricted subset of Xpath (I know you know this:-)

    David

    bazzargh said:

    me again
    I got curious about your comment that .Net has a dispatch-on-match facility, and went to find it: Dare Obasanjo has an
    article about using XPathReader, which appears to be it. Still looks quite early, but its nice. Reading the workspace forum it seems they've got the same issues with xpath's random-access being a mismatch to streaming though.

    bazzargh said:

    Nobody nodes the trouble I've seen
    What I want is a function that will take "x/y/z", rewrite it to "self::z[parent::y[parent::x]]" and then test using that

    just to make it clearer why I don't think thats the best plan - it means storing everything that the "does this node match" test might need so you have it ready at the point when the test is done. With xpath, that means the whole tree (ie including nodes the stream processor hasn't reached yet)

    bazzargh said:

    Nobody nodes the trouble I've seen
    I can see why you'd want this - you're trying to test "x/y/z | x/z" and then check which matched - but I'm not sure I agree its a good idea.

    Writing a schematron-like processor on top of jaxen is fairly trivial (I have one in the works for validating object graphs in our own system), but won't perform on huge nodesets (jaxen works by narrowing over the set of all nodes as it steps through the xpath, so requires something tree-like).

    Hence your need for an xpath engine optimized for stream processing ... but xpath is designed for random-access and can end up pulling the whole thing in anyway. Maybe a better fit would be a 'streaming schematron' with the Stxpath query language instead?

    rjelliffe said:

    Nobody nodes the trouble I've seen
    That is close, but not it: that Jaxen method evaluates the XPath from a context which you provide. If you queried your node the XPath "x/y/z" it would answer true if it has at least one child x containing at least one child y containing at least one child z.

    What I want is a function that will take "x/y/z", rewrite it to "self::z[parent::y[parent::x]]" and then test using that IYSWIM

    bazzargh said:

    Nobody nodes the trouble I've seen
    "Do you match this XPath?"

    Is this what you mean?
    http://jaxen.org/apidocs/org/jaxen/BaseXPath.html#booleanValueOf(java.lang.Object)

    "public boolean booleanValueOf(Object node)
    throws JaxenException

    Retrieve a boolean-value interpretation of this XPath expression when evaluated against a given context...an expression that selects zero nodes will return false, while an expression that selects one-or-more nodes will return true. "

    The other stuff sounds vaguely like it needs a Xpath support in Stax, though I feel there ought to be a way to do this using DOM+Jaxen - I just can't picture it right now :(

    Topics of Interest

    Related Books

    Recommended for You

    Archives


     
     


    Or, visit our complete archive.  

    Stay Connected