1. Field of the Invention
The present invention relates generally to an improved data processing system, and specifically to an improved method and apparatus of organizing data. More specifically, the present invention relates to a computer implemented method, an apparatus, and a computer usable program product for indexing data.
2. Description of the Related Art
A tree is a common type of data structure used to represent an extensible Markup Language (XML) document. A tree is a data structure formed from a set of connected nodes that includes a root node, a set of internal nodes, and a set of leaf nodes. The root node is the top-most node, or the parent node from which all other nodes branch off. A child node descends downward from the root node, with the leaf node being the bottom-most node. The child nodes between the root node and the leaf nodes are considered internal nodes. A subtree typically branches from an internal node, and an internal node includes a set of leaf nodes.
Efficiently evaluating twig queries in XML documents are at the core of structured query processing. A twig is a branch extending or descending from the root node. A query is a method for extracting information from a data structure. Therefore, a twig query is the extraction of information from a tree data structure via one of the branches extending from the root node.
Typically, to evaluate twig queries, current approaches disassemble a query into multiple root-to-leaf simple paths. In other words, current approaches break down a tree so that each tree branch is serialized into a single path. With the assistance of some indexing structures, the simple path queries are each independently evaluated, and the results of each independent evaluation are subsequently joined together to form a final answer. However, the process of dissembling a query and joining the intermediate results into a final result is an expensive operation because the process utilizes a lot of memory and hardware resources during processing, which impacts the performance in processing other requests in a data processing system. Consequently, joining intermediate results into a final result is one of the most significant costs in evaluating twig queries.
One method for eliminating intermediate joining operations is to transform trees into a sequence-based query process. The sequence-based query process converts documents into a one-dimensional sequence with the sequence including enough information so that each sequence can be converted back to the original tree format. However, current sequence-based approaches are under-optimized in both index space and query time, because the tree structures are inherently incompatible with one-dimensional sequence structures. When trees are converted into one-dimensional sequences, the total order of the nodes from the original document is not translated into the final reconstructed tree. Therefore, redundancies in tree paths can exist in the final reconstructed tree, thereby increasing overall query times.
Furthermore, the priority of current sequence-based approaches is to ensure representation equivalence in sequencing and query equivalence in query processing. In other words, the sequence-based approach ensures that no ambiguity exists in the data representation, and that the original and only the original tree structure can be derived from the sequence. However, as indicated above, the reduction in ambiguity results in redundancies in tree paths, which in turn results in non-optimized index sizes, and which ultimately translates to larger indexes at increased cost for storing the index.