The present invention relates to the field of data processing, and particularly to a software system and associated method for use with a search engine, to search indexed and raw data obtained from systems that are linked together over an associated network such as the Internet. This invention more particularly pertains to a computer software product for retrieving resources in XML documents through the extensive use of schema (Document Type Definition) for query processing and optimization.
The World Wide Web (WWW) is comprised of an expansive network of interconnected computers upon which businesses, governments, groups, and individuals throughout the world maintain inter-linked computer files known as web pages. There is an increasing interest in caching and indexing web pages so as to facilitate the query process using the indexed information. To this end, a crawler crawls the web and extracts metadata, or information about the data on the web, and stores the metadata in a data repository or data store. The vastness of the unstructured WWW causes users to rely primarily on Internet search engines to retrieve or locate information.
Pages on the WWW commonly use HTML as the computer language that describes and displays the page information on the web. HTML is a computer language primarily used for browsing or viewing documents on the web. An HTML document is typically treated as a single block of text, which is easily indexed and searched.
Increasingly, web pages are written using XML, a computer language more complex than HTML that renders information for viewing on the web. Also, the hierarchical nature of XML makes it suitable for describing meta (index) information. The XML standard specification requires that XML documents be well formed, i.e. always have a proper nested structure. However, XML documents need not be valid to meet the XML specification, i.e. the document may not have a schema or may not be validated against an associated schema. In such cases, it is possible to infer the schema from the document. But the multiplicity of non-uniform schemas is difficult, time consuming, and inefficient for a search engine to search or query. To optimize the query process, the number of schemas queried must be minimized.
Traditionally, query process optimization has been focused either on the structure or on the content of the XML document. The optimization technique based on structure or schemas does not look at the entire XML document but rather searches the schema or its Document Type Definition (DTD). DTD is a language that describes the structure of an XML document. This query optimizer extracts from the (DTD) three kinds of structural relationship sets or rules:
Obligation, which for a particular element is the set of dependents that this element must necessarily have.
Exclusivity, which for an element is the set of ancestors that must be present for that particular element.
Entrance location, which describes the elements that should be present in any path between two nodes.
The limitation of this query optimization is that only three structural relationship sets are used. Queries usually contain other structural conditions that are not captured by these three relationships.
The second traditional approach to query optimization is to use indices. Indexing has typically been studied by persons skilled in the database field, who do not consider that indexing to be a requirement for tree pattern matching, but rather consider path indices, which are insufficient for tree pattern matching.
There is therefore still an unsatisfied need for an XML query optimization system that meets the following criteria:
The number of schemas queried by the query optimizer must be minimized in order to optimize the query process with respect to the query processing time and the size of the query index.
An XML document has structure and content that need to be indexed together.
Provides an indexing structure for tree pattern matching.
The structure extracted by the query optimizer must consider all structural relationship sets present in the schema.
Existing query optimization systems do not meet all the foregoing criteria above, and the need for such a system has heretofore remained unsatisfied.
This invention describes a system and method this need, and provide the capability to optimize the XML document query process by:
minimizing the number of schemas examined by a search engine by reducing the number of different schemas in the data store;
indexing both structure and content of the XML document to minimize the number of steps in the query process; and
extracting from the DTD all structural relationship sets applying to a specific XML document.
This feature will enable the system of the present invention to minimize the time required to process a search request while also minimizing data storage requirements.
The query system of the present invention provides several features and advantages, related to the following:
data model;
query language; and
index format.
The data model of the query system of the present invention views the XML repository as a set of schema/data pairs where, for every schema, the query system maintains the set of documents that conform to that schema. The data in an XML document is viewed by the query system as a graph with the xe2x80x9cedgesxe2x80x9d between the graphs used to represent inter-document links. This data model allows queries on content, structure, plus inter-document links and intra-document links.
The query language uses XML syntax and supports tree pattern matching. This allows the query system to compute a DTD for the query language and use it to validate the user query during query formulation. Using a tree structure instead of the traditional path expression, makes the query language easier to use, and allows the query language to specify conditions on siblings in the tree structure. The query language also:
returns both the subtree rooted at the element and the element (with or without attributes);
allows restructuring of documents, since many queries require the query result in a different structure than the source document;
allows specification of grouping result elements;
supports specifications of regular path expressions for reaching an element;
supports queries on the order among child nodes;
identifies data types associated with the different content values in a XML document;
supports common database constructs for supporting set operations;
supports the semantic as well as the literal view for both intra-document and inter-document links; and
queries the text content of the nodes of an XML document individually, combined at the node level, and combined at the subtree level.
The query system of this invention maintains three kinds of indices for an XML document: a value index corresponding to text; a structure index corresponding to tree structure patterns; and a link index corresponding to link relationships. These indices are organized as a trie, ensuring that the index access cost is linear with respect to the size of the query and independent of the number of indices present. For storage and search efficiency, the structure pattern is converted to a string.