1. Field of the Invention
The present invention is generally related to document search engines utilizing text based queries to identify query related relevant documents and, in particular, to an enhanced document search engine providing high-speed query response against a real-time updatable, potentially distributed document collection.
2. Description of the Related Art
The volume and variety of informational content available on the World Wide Web (the "Web") is and will likely continue to increase at a rather substantial pace. This growth, combined with the highly decentralized nature of the Web, creates a substantial difficulty in locating particular informational content. Various Web-based search sites support generalized content based searching. Even when content is found through a Web document search, a further difficulty occurs in trying to evaluate the relative merit or relevance of concurrently found documents. The search for specific content based on a few key words will almost always identify documents whose individual relevance is highly variable. Thus, the sheer volume of documents available on the Web tends to undermine the usefulness of the Web in finding and making use of particular informational content.
The Web search services generally perform an incremental scan of the Web to generate various, often substantial indexes that can be later searched in response to a user's query. The generated indexes are essentially proprietary databases of document identification information; proprietary in the sense that the structure and content of unrelated index instances may vary considerably. Currently, full Web index files are easily on the order of 200 gigabytes and increasing at an average rate of about 5% per day.
Web search services typically need to support a number of specific search capabilities to be at least perceived as a useful document locator service within the Web community. These capabilities include performing relatively complete searches of all of the available Web information, providing fast user-query response times, and developing an appropriate relevance ranking of the documents identified through the index search, among others.
In order to support a relatively complete search over any collection of documents, the derived document collection index managed by a Web search service may store a list of the terms, or individual words, that occur within the indexed document collection. Words, particularly simple verbs, conjunctions and prepositions are often preemptively excluded from the term index as presumptively carrying no informationally significant weight Various heuristics can be employed to identify other words that appear too frequently within a document collection to likely serve to contextually differentiate the various documents of the collection. As can be expected, these heuristics are often complex and difficult to implement without losing substantive information from the index. Furthermore, as these heuristics generally operate on a collection-wide basis to minimize unbalanced loss of information, a distributed database architecture for storing the document collection variously becomes prohibitively complex to implement, slow in terms of query response time and quite limited in providing global relevancy ranking.
In order to improve query response time, conventional Web search services often strive to minimize the size of their indexes. A minimum index format provides identifications of any number of documents against particular indexed terms. Thus, word terms of a client query can be matched against the collection index terms to identify documents within the collection that have at least one occurrence of the query terms. A conventional relevancy score can be based on the combined frequency of occurrence of the query terms on a per document basis. Other weighing heuristics, such as the number of times that any of the query terms occur within a document, can also be used. These relevance ranking systems typically presume that increasing occurrences of specific query terms within a document means that the document is more likely relevant and responsive to the query. A query report listing the identified documents ranked according to relevancy score is then presented to the client user.
Simple occurrence indexes as described above are, nonetheless, quite large. In general, a term occurrence index maintained in a conventional relational database management system will be approximately 30% of the total size of the entire collection.
At the expense of index size, proximity information is conventionally utilized to improve document relevancy scoring. The basic occurrence index is expanded into a proximity index by storing location-of-occurrence information with the document identifications for each of the indexed terms in a document collection. Storing the expanded term-proximity information results in the size of the index typically being on the order of 60 to 70 percent of the total size of the document collection.
The term-proximity information provides an additional basis for evaluating the relevancy of the various documents responsive to a particular client query. Conventional search engines can to post-process the client query identified documents to take into account the relative proximity of the search terms in individual documents. In effect, a revised relevancy ranking of the documents is generated based on whether and to what degree query terms are grouped in close proximity to one another within the identified document. Again, the conventional presumption is that the closer the proximity of the terms, the more likely the document will be particularly relevant and responsive to the client query.
Various schemes can be utilized to further weight and balance the relevancy scores derived from term frequency and term proximity. While a number of such schemes are known, the schemes operate on the reasonable and necessary premise that all relevant documents need to be initially identified from the collection before a final relative relevancy score can be computed. The relative relevancy is then calculated based on the full set of query identified documents, Thus, existing search systems cannot effectively operate against a document collection index that, due perhaps to size or to support parallel access, is fragmented over multiple server systems or against multiple collection indexes that are served from multiple distributed servers.
Furthermore, to determine if the proper, consistent ranking of the full set of query identified documents produce the ranking scores conventionally must be calculated over the full set of identified documents. Large amounts of information must therefore be pooled from the potentially multiple index servers in order to perform the aggregate relevancy scoring. Consequently, the convenience as well as capacity and performance potentially realized by use of distributed servers is not generally realized in the implementation of conventional search systems.
Another significant limitation of conventional search systems relates to the need to ensure the timeliness of the information maintained in the collection indexes. For large collections, the collection indexes need to be rebuilt to add or remove individual document-to-term relations. The process of building and rebuilding a collection index is quite time consuming. The rapid rate of document collection content change, however, requires that the indexes be updated frequently to include new index references to added or exchanged documents. Known index preparation functions and procedures are unfortunately one, if not many orders of magnitude slower than the typical rate of document collection content change. Ever larger and faster monolithic computer systems are therefore required to reduce the document collection indexing time. While computer performance continues to steadily increase, the rate of document additions and changes appears to be far greater. Furthermore, any increase in computer performance comes at a much increased cost. Thus, practical considerations have generally become limiting factors on the performance, size and assured timeliness in searching collections for query identified documents.
Consequently, there is a clear and present need for a collection search system that is scalable without loss of performance or repeatable accuracy and that can be actively maintained current substantially in real-time.