1. Field of the Invention
The present invention relates generally to indexing Web-based hierarchical documents for search.
2. Description of the Related Art
Millions of documents such as but not limited to hypertext markup language (HTML) documents and extensible markup language (XML) documents exist on the World Wide Web, with the Web documents being accessible to user computers via the Internet. In light of the immense number of documents that are accessible, an essential part of Internet technology is the document search engine. Users can, by means of a search engine, rapidly query for and locate documents of interest.
Typically, a search engine has three main parts. The first part is a crawler, which accesses Web documents and gathers information about the documents. The information is summarized either by the document producer or by the crawler, with each summary being arranged in a hierarchy and being referred to as “metadata”. The metadata is “marked up” by means of tags, i.e., each item of information in the hierarchy is labelled by a corresponding tag, to identify the item of information.
Once the crawler has generated the metadata, an index engine indexes the metadata. The index essentially is a catalogue of the metadata. Then, a query executor portion of the search engine responds to a user query by accessing the indexed metadata and returning the names (also referred to as “uniform resource locators”, or URLs) of documents that satisfy the query.
The focus of the present invention is on the indexing phase of a search engine. As recognized by the present invention, the metadata that a crawler creates includes not only data about document content, which is useful to a query executor during the search phase, but also includes internally useful information such as the name of the crawler, date of the crawl, and so on. Moreover, as noted above the metadata summary is marked up with tags that identify the various elements in the summary.
As understood herein, the tags (as opposed to the information identified by the tags) and the internally useful information are not necessarily useful to the query executor, but rather, in the context of the query phase, constitute noise. Moreover, the present invention understands that, depending on the document type, some information as identified by the tags happens to be more useful in the context of the query phase than other information. Unfortunately, as recognized by the present invention current indexing engines do not separate tags from the data identified by the tags, nor do they provide a means for weighting relatively important information more highly than less important information, nor do they provide a means for eliminating completely useless (from a query execution standpoint) information from the index. Thus, the present invention understands that current indexing engines do not optimize the subsequent performance of query executors. The present invention recognizes the above-noted problems and provides the solutions disclosed herein.