A search engine helps a user to locate information. Using a search engine, a user can enter one or more search query terms and obtain a list of resources that contain or are associated with subject matter that matches those search query terms. While search engines may be applied in a variety of contexts, search engines have been used especially for locating resources that are accessible through the Internet.
XML (eXtensible Markup Language) is becoming increasingly popular as the format for describing and storing data. Thus, providing support for searching XML documents is an extremely important problem for modern user search engines. XML documents have a hierarchical structure. Search engines that locate resources on the Internet cannot rely on the structure of the web pages being searched, and thus are not well-suited to take advantage of the structure for performing the search. The documents searched by an Internet search engine are treated as just sequence of bytes, and there is no distinguishing between meta-data indicating structure and the document content.
XML elements are delimited by a start tag and a corresponding end tag. For example, in the following XML fragment, <Author> is a start tag and </Author> is an end tag to delimit an element.
<book>My book  <publication publisher=”Doubleday”    date=”January”></publication>  <Author>Mark Berry</Author>  <Author>Jane Murray</Author></book>The data between the element start and end tags is referred to as the element's content. An element's content may include values and other elements. In the case of the Author element, the content of the element is the text data value Mark Berry. In the case of the Book element, the content includes the value My Book and the elements publication and two Author elements. An element is herein referred to by its element name. For example, the element delimited by the start and end tags <publication> and </publication> is referred to as publication.
Information about the structure of specific types of XML documents may be specified in documents referred to as “XML schemas”. For example, the XML schema for a particular type of XML document may specify the names for the data items contained in that particular type of XML document, the hierarchical relationship between the data items contained in that type of XML document, data types of the data items contained in that particular type of XML document, etc. In addition to wanting to search for data values contained within a collection of XML documents, users may find it useful to be able to search and discover the structure of the documents as well. Internet search engines do not provide the user the ability to search for elements of the structure of the document, such as for tag names.
Node Tree Model
An important standard for XML is the XQuery 1.0 and XPath 2.0 Data Model. (see W3C Working Draft 9 Jul. 2004, which is incorporated herein by reference) One aspect of this model is that a XML document is represented by a hierarchy of nodes that reflects the hierarchical nature of the XML document. A tree hierarchy of nodes is composed of nodes at multiple levels. Nodes at the same level are sibling nodes. In a tree hierarchy or node tree, each child node has only one parent node that resides at a higher level than the child node, but a parent node may have multiple child nodes residing at a lower level than the parent node. In a tree hierarchy, the root node is one that has no parent node, and a leaf node has no child nodes. A tree hierarchy has a single root node.
In a node tree that represents a XML document, a node can correspond to an element. The child nodes of the node correspond to an attribute or another element contained in the element.
The node may be associated with a name. For example, the name of the node representing the element book is book. For a node representing the attribute publisher, the name of the node is publisher.
For convenience of expression, elements and other parts of a XML document are referred to as nodes within a tree of nodes that represents the document. Thus, a node representing an element may be referred to by the element name, and a node value may be referred to as the element value. For example, referring to ‘My book’ as the value of the node with the name book is just a convenient way of expressing that the value of the element associated with node book is My book. The name of an element, attribute, or node is also referred to herein as a tag name.
The path for a node in a XML document is the series of nodes, starting from an ascendant node in a XML document to arrive at a particular node further down in the hierarchy. For example, the path from the root of XML document to node publication is represented by ‘/book/publication’.
Path Expressions
XML documents may be searched by using an XML query language such as XQuery/XPath. XML Query Language (“XQuery”) and XML Path Language (“XPath”) are important standards for a querying data in XML documents. The primary syntactic construct in XPath is an expression, which is evaluated to yield an object. XPath expressions are described in Section 3 (“Expressions”) of “XML Path Language (XPath)” (version 1.0). A path is a location of a node within an XML document hierarchy; a path expression is a representation of (a way of expressing or specifying) that location. Constructing a path expression may require that the user know the structure of the document. Thus, when the collection of XML documents does not have a schema that expresses their structure, or there is not one common schema to which all XML documents in the collection conform, it can be difficult using XQuery/XPath to formulate a query to find information in those documents.
What is needed is a way to provide assistance to users wanting to search a collection of XML documents when the user does not have complete knowledge of the documents' structure. Such an approach can be used for targeting the search to a particular portion of the XML hierarchy.
Approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.