LinkedIn

Tuesday, January 27, 2009

Optimizing XQueries

Well can we imagine programming in the SOA world without knowledge of XML Technologies. As a matter of fact if we are working on ALSB and ALDSP then the knowledge of XPaths and XQueries is of prime importance. Here i would be discussing the optimal practice of writing XQueries.

Explanations of each of the following tips can be found at the end of the article.

DON'TS
Here are a few things that we must seek to avoid:
  1. Don't use eval ()
  2. Don't evaluate expressions several times over, and avoid redundant expressions.
  3. Don't use //
  4. Don't query constructed document fragments
    DOS
    Here are some recommendations for optimization:
    1. Minimize the execution of queries based on a given search expression. Try instead to use navigation paths based on the parent, children and siblings of a node which has already been retrieved
    2. Make appropriate use of indexes adapted to your search criteria.
    3. Code Quality
      TODO
      1. Put $Id$ inside a comment at the top internal documentation of HTTP parameters
      2. Document in Xquery the argument types and the return type
      3. Use meaningful names for variables and functions, without abbreviations, and avoid ambiguous terms
      4. Use Javadoc-style tags as in XQDOC ( http://www.xqdoc.org/ ) : @param, @return
      5. Keep data retrieval separate from result construction
        _______________________________________________________________________________________________

        EXPLANATIONS

        Don't use eval ()

        The snag is, the arguments to the eval () function can't be cached. Beyond that, using eval () leads to a style of programming that's hard to read and to debug. And eval () can always be replaced by a standard expression.

        Don't evaluate expressions several times over and avoid redundant expressions

        Xquery doesn't perform any analysis or optimization of queries akin to what a Java compiler does. So no refactoring of repeatedly-evaluated expressions, no elimination of code that won't be executed, etc. Pay particular attention to repeatedly evaluated expressions, they should be evaluated once only and the result placed into a variable, which also makes for more readable code.

        Don't use //

        $a//b causes a complete traversal of all nodes of which $a is the root in search of an element b. In most cases the location of b is fairly precisely known, and so would be better to specify it.

        Don't query constructed document fragments

        A typical example (to avoid):
        let $e := content (: $e is a constructed document fragment :)
        let result := $e/b/text()

        Minimize the execution of queries based on a given search expression.

        A query like

        res := collection("/db/projects") /a/b [ id = $val ]

        causes a complete scan of an entire collection. Admittedly, queries like this are at the heart of an XQuery (and account for most of its execution time). But once the result $res has been retrieved, it can be efficiently used as a starting point for navigation to its parent, siblings and children:

        $a: = $res / parent::a
        $next-sibling: = $a / next-sibling:a

        Make appropriate use of indexes adapted to your search criteria.

        There are currently three types of user-configurable indexes in Xquery. All require pre-indexation either of the base collection or of specified node-sets in sub-collections.
        • The fulltext index, which indexes lexical tokens ("words" in Western scripts). Indexation can be configured to include or exclude nodes specified using a limited subset of XPath
        • Typed indexes over nodes specified by a limited subset of XPath (called "range indexes" because they permit queries referring to a range of numerical values)
        • Indexes by tag name ("Qname index") http://wiki.exist-db.org/space/jmvanel/New+index+by+QName
        Index 2. is slower than 3., but has two advantages

        The request code doesn't have to be changed in order to use the index with 2, there is no danger of getting wrong results if the indexation hasn't been done.

        Index 3 lacks these advantages, but is almost as fast as a relational database. Such an index cannot be constrained by an XPath, but only by a tag name. Both index and and index 2 are typed (integers or strings), and allow matching by criteria of equality or inequality (comparison).

        Document in XQuery the argument and return types

        Don't write :

        declare function local:add($n, $m) {
        $n + $m
        };
        This is more explicit and auto-documenting. And for the same price you get run-time arguments checking. If you know for sure the types you manipulate, declare them !

        declare function local:add($n as xs:integer, $m as xs:integer)
        as element(result) {
        $n + $m
        };

        Also Keep data retrieval separate from result construction. It is good to create variables of child element nodes that are used in result construction rather than retrieving them every time from the root.

        2 comments:

        1. Hi,

          thanks for this useful post.

          I have an XQuery application which queries the database(which has around 100000 xml's). I need to output only the first 100 results for any query. Is there an easy way to do this ?

          Also, could you pls elaborate on how not to use eval() ? I have the search path for the query in a string form, and I am forced to use eval().

          eg.,
          collection(”/db/projects”) /a/b
          if this is my search path, I have '/a/b' as a string, not a node-set.

          btw, I am using eXistDB database.

          Thank you.

          ReplyDelete
        2. To limit the resultset returned from an Xquery operation to a database I think this might work.

          let $resultSet := (for $rows in /table order by $rows/title descending return $rows) return $subsequence(1 to 100)

          And there is no reason not to use eval(). Its just like since it used to dynamically execute a constructed XQuery expression inside a running XQuery script it tends to become a overhead in some cases.

          ReplyDelete