As is known in the art, the eXtensible Markup Language (XML) employs a tree-structured model for representing data. Queries in XML query languages typically specify patterns of selection predicates on multiple elements that have some specified tree structured relationships. For example, the XQuery expression:book[title=‘XML///author[fn=jane’ AND In=‘doe]matches author elements that (i) have a child subelement “fn” with content lane”, (ii) have a child subelement “In” with content “doe”, and (iii) are descendants of book elements that have a child title subelement with content XML. This expression can be represented as a node-labeled twig (or small tree) pattern with elements and string values as node labels.
Finding all occurrences of a twig pattern in a database is a core operation in XML query processing, both in relational implementations of XML databases, and in native XML databases. Known processing techniques typically decompose the twig pattern into a set of binary (parent-child and ancestor-descendant) relationships between pairs of nodes, e.g., the parent-child relationships (book, title) and (author, fn), and the ancestor-descendant relationship (book, author). The query twig pattern can then be matched by (i) matching each of the binary structural relationships against the XML database, and (ii) “stitching” together these basic matches.
In one known attempt at solving the first sub-problem of matching binary structural relationships, Zhang et al., “On Supporting Containment Queries in Relational Database Management Systems,” Proceedings of ACM SIGMOD, 2001, (hereafter “Zhang”), proposed a variation of the traditional merge join algorithm, the multi-predicate merge join (MPMGJN) algorithm, based on the (Dodd, LeftPos RightPos, LevelNum) representation of positions of XML elements and string values. Zhang's results showed that the MPMGJN algorithm could outperform standard RDBMS join algorithms by more than an order of magnitude. Zhang is incorporated herein by reference.
A further sub-problem of stitching together the basic matches obtained using binary “structural” joins requires identifying a ‘good’ join ordering in a computational cost-based manner taking selectivities and intermediate result size estimates into account. A basic limitation of this traditional approach for matching query twig patterns is that intermediate result sizes can get quite large, even when the input and final result sizes are more manageable.
It would, therefore, be desirable to overcome the aforesaid and other disadvantages.