XML is a markup language for documents containing structured information. A document that conforms to the XML standard (“an XML document”) contains one or more elements, the boundaries of which are delimited by angle brackets using start-tags and end-tags, or, for empty elements, by an empty-element tag. For example,                <background> </background>is an element bounded by a start-tag and an end-tag, and        <standalone/>is an example of a empty-element tag.        
Each element has a type, identified by name, and may have a set of attribute specifications. For example, the type of                <background class=“example”> </background>is ‘background.’ Each attribute specification has a name and a value. In XML, all attribute values are quoted. Thus, the name of the attribute specification (or simply the attribute) in the above element is ‘class,’ while the value is ‘example.’        
Elements may circumscribe or be associated with displayed content. For example, the following ‘background’ element,                <background class=“example”>Hello World</background>circumscribes the text “Hello World.” In addition to or instead of text, other elements may appear between the start-tag and the end-tag of an element. For ease of explanation, any text or elements within the start-tag and end-tag of an element will be said to be circumscribed by that element.        
Various approaches may be used to retrieve existing XML documents based on a set of search criteria. XML documents that satisfy a particular set of search criteria are referred to hereafter as “matching XML documents”. For example, one may wish to retrieve all matching XML documents that contain a specified set of elements and/or element attributes.
One approach for retrieving Matching XML documents is to perform a brute force search. A brute force search for XML documents is characterized by examining each of the XML documents, one at a time, to determine if the XML document currently being analyzed corresponds to the set of search criteria before analyzing another XML document. If a set of XML documents is stored in a set of one or more database tables, where one XML document resides in each row of the one or more tables, a brute force search of those documents would be performed by examining each row of the one or more tables to determine if the XML document in that row meets the set of search criteria. The brute force search is undesirable because it is slow and inefficient, especially if the table storing the set of XML documents to be searched is large, as a full table scan must be performed.
Another approach for retrieving matching XML documents involves using a node oriented tree index. FIG. 1 is an illustration of a node oriented tree index 100 used in retrieving XML documents according to this approach. Displayed on FIG. 1 is a set of elements 102 and a node oriented tree index 100 that represents elements 102. The set of elements 102 is an example of the elements that may be found within an XML document. Individual nodes of node oriented tree index 100 contain information related to the elements. The top-level node 110 of node oriented tree index 100 corresponds to an element of type A. Nodes 112 and 114, which are child nodes of the top-level node 110, correspond to those elements immediately circumscribed by the previous element, namely, two elements of type B. The first element of type C itself circumscribes an element of type C, which is represented by node 116.
Node oriented tree index 100 may comprise an arbitrarily number of levels. As a result, node oriented tree indexes suffer from being hard to analyze because it is difficult to perform multiple level jumps because the nodes do not contain information about the overall structure of the index, but merely contain references to parent and child nodes. For example, upon analyzing node 110, one cannot determine how many nodes one must traverse in order to locate elements of type C, or the most efficient way to determine where a particular element is represented. For example, one may have to traverse the entire tree to locate the representation of a particular element. As the tree becomes deeper and wider, the inefficiencies of searching the entire tree increase.
An alternate approach for retrieving XML documents that meet a set of search criteria involves using an inverted index. In this context, an inverted index is an index that uses entries that reference individual documents in a set of documents. For example, consider an inverted index that indexes a set of text-based documents. Each entry in the inverted index comprises a word and a list of documents, possibly with locations within the text, where that word occurs.
For example, suppose one wishes to search three documents, named “1”, “2”, and “3”, whose, contents are respectively: “the cat in the hat,” “the cat on the mat,” and “I put the hat on the mat.” If the index is in the format of ‘word (text where word is found, position of word within the text)’, the index with location information may be represented by:                the (1,1); (1,4); (2,1); (2, 4); (3, 3); (3, 6)        cat (1,2); (2,2)        in (1,3)        hat (1,5); (3,4)        on (2,3); (3,5)        mat(2,5); (3,7)        I (3,1)        put (3,2).        
The word “cat” is in document 1 (“the cat in the hat”) starting at position 2, and therefore has an entry (1,2). To find, for instance, documents with both “on” and “mat,” first look up the words in the index, and then find the intersection of the texts in each list. In this case, documents 2 and 3 have both words. Documents may be retrieved using the inverted index which contain a specified search criteria. In other words, a list of documents containing the search terms may be retrieved using the inverted index.
Even though a measure of how close the words appear to each other may be determined by comparing the positions of words within the document, inverted indexes do not store the relationship between the words and thus, cannot perform complex queries. For example, inverted indexes could not be used to retrieve all documents that contain the word “cat” in the first sentence of the third paragraph. The additional complexity introduced by the XML language within documents is beyond the capabilities of inverted indexes to process such a query.
Based on the foregoing, it is highly desirable to provide a mechanism for processing a query to retrieve XML documents that overcomes the problems and limitations of the prior art.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.