The present invention is generally directed to the field of information search and retrieval, and more particularly to techniques for locating and ranking documents and/or other forms of information that are contained in multiple collections accessible via a network.
Computer networking technology has made large quantities of digital content available to users, resulting in a phenomenon popularly known as information overload; users have access to much more information and entertainment than they can absorb. Significant practical and commercial value has therefore been provided by search technologies, whose goal is to identify the information that is of greatest utility to a user within a given content collection.
The quality of a search is typically quantified by two measures. First, a search should find all the information in a collection that is relevant to a given query. Second, it should suppress information that is irrelevant to the query. These two measures of success correspond to recall and precision, respectively. A search is considered less effective to the extent that it cannot maximize both measures simultaneously. Thus, while one may be able to increase recall by relaxing parameters of the search, such a result may be achieved only at the expense of precision, in which case the overall effectiveness of the search has not been enhanced.
A metasearch combines results from more than one search, with each search typically being conducted over a different content collection. Often, the various content collections are respectively associated with different information resources, e.g. different file servers or databases, in which case the metasearch is sometimes referred to as a distributed search. The present invention is concerned with the difficulty of maximizing both the recall and precision of a metasearch, particularly one that is conducted via distributed resources. The following discussion explains the sources of such difficulty.
For simplicity of exposition, the issues will be discussed herein with reference to keyword-based queries of text-based content. As practitioners familiar with the field will recognize, the disclosed principles are easily generalized to queries of text-based content that are not purely keyword-based (such as natural-language queries into parsed documents), as well as to queries of content that is not text-based (such as digital sounds and images). The applicability of the present invention to methods for processing such queries will be readily apparent to those skilled in the art.
To facilitate an understanding of the invention, the following definitions are used in the context of exemplary keyword-based searches that are employed to describe the invention. A xe2x80x9ctermxe2x80x9d is defined to be a word or a phrase. A xe2x80x9cqueryxe2x80x9d is a set (mathematically, a bag) of terms that describes what is being sought by the user. A xe2x80x9cdocumentxe2x80x9d is a pre-existing set of terms. A xe2x80x9ccollectionxe2x80x9d is a pre-existing set of documents. A xe2x80x9cmetacollectionxe2x80x9d is a pre-existing set of collections.
A ranked search is the procedure of issuing a query against a collection and finding the documents that score highest with respect to that query and that collection. The dependence of each score on the entire collection often stems from the well-known technique of weighting most strongly those search terms that are least common in the collection. For example, the query xe2x80x9chigh-tech farmingxe2x80x9d would be likely to select the few documents in a computer collection that contain the term xe2x80x9cfarmingxe2x80x9d, and the few documents in an agriculture collection that contain the term xe2x80x9chigh-techxe2x80x9d.
A metasearch is the procedure of responding to a query against a metacollection by combining results from multiple searches. For the metasearch to be maximally precise, it should find the documents that score highest with respect to the metacollection, not those that score highest with respect to the individual collections in which they reside. For example, in a metasearch over the two aforementioned collections, if a query contains the term xe2x80x9ccomputer,xe2x80x9d an incorrect implementation would give undue weight to computer-related documents that appear in the agriculture collection. The practical impacts of this effect are substantial to the extent that a metacollection is used to cull information from diverse collections, each with a different specialty or focus.
A process that executes an individual search is called a search engine. A process that invokes search engines and combines results is known as a metasearch engine. FIG. 1 depicts the general components of a metasearch system. Typically, the user presents a query to a metasearch engine 10. The metasearch engine forwards this query on to multiple search engines 12a, 12b . . . 12n, each of which is associated with a collection 14a-14n of information content, e.g. documents 15. Most documents are likely to appear in only one collection. However, some documents can appear in more than one collection, as depicted by the overlap of the sets of documents 15 in collections 14b and 14n. In such a case, multiple references to a document can appear in the results of a metasearch which employs both of these collections. A well-designed metasearch engine attempts to remove duplicates whenever possible.
The relationship between search engines and collections need not be one-to-one. For example, as depicted in FIG. 1, two different search engines 12b and 12c may both execute a query against the same collection 14b. In the context of the present invention, this situation is considered to be within the meaning of executing a query on different collections, namely the collection 14b as processed by the search engine 12b, and the collection 14b as processed by the search engine 12c. In some cases, the two search engines could operate with different sets of heuristics. In such a situation the two search engines might produce different results, e.g., different rankings within the respective documents of the same collection. In the particular situation depicted in FIG. 1, since some documents are common to collections 14b and 14n, three references to those documents could be returned to the metasearch engine by search engines 12b, 12c and 12n, respectively.
The metasearch engine 10 and the various search engines 12 execute on computers that communicate with one another via a network. In a fully distributed metasearch, each engine 10, 12 executes on a different machine. In a less distributed system, two or more of these engines may execute on the same machine. For instance, the search engines 12a and 12b may execute on the same computer 16, or the metasearch engine 10 and one or more of the search engines 12 may execute on the same computer. Similarly, the various collections 14 may reside in different respective storage systems, or any two or more of them can share a common storage facility. The efficiency with which information is exchanged between the metasearch engine 10 and the various search engines 12 via the network is a significant factor in the overall user experience.
In a system that implements metasearch capability, it is desirable to identify the documents that score highest with respect to the metacollection, i.e. the totality of the collections 14a-14n. The more significant components of the system are the search engines, the metasearch engine, and the protocol by which they communicate. When the search engines exist on different machines in a distributed network, it is further desirable for the communication protocol to minimize the amount of latency perceived by the user, as well as the resource burden in terms of bandwidth and processing power.
Numerous metasearch implementations exist in the commercial world and in the academic literature. Because of fundamental differences in approach, these vary significantly in precision. FIG. 2 illustrates a taxonomy of the various implementations for metasearch techniques. Before discussing these implementations, however, one distinctive concept should be noted. Much of the prior art operates by centralized indexing, which is not a form of metasearching. With centralized indexing, the original documents remain in their distributed locations but an index database is stored in a central location. The index is built by xe2x80x9ccrawlingxe2x80x9d, i.e. copying each document to the centralized facility to be indexed. Unless the copy of a document is required for future retrieval from the central location, it can be destroyed after it has been indexed. Relative to metasearching, central indexing schemes have three main disadvantages. First, because the indexing process takes time, the index is more difficult to keep up to date. For example, crawling the entire Internet takes weeks. Second, unless a protocol is in place specifically to skip the indexing of unchanged documents, the indexing process wastes processing time at the central index and network bandwidth. Third, the hardware resources required to store the central index and to execute queries against it grow at least linearly with the size of the collection being indexed. Thus, metasearch techniques provide significant practical and commercial advantages.
Turning now to FIG. 2, metasearch techniques can be classified into two broad categories, Boolean searches and ranked searches. The weakest metasearch techniques apply Boolean rules that either accept or reject each document, and do not supply a score. Thus, they cannot prioritize results in the event that the result set is too large for the user to consume. In contrast, ranked metasearch techniques permit such prioritization.
The category of ranked metasearches can be further divided into centralized and decentralized ranking. In centralized ranking implementations, each search engine sends candidate documents to the metasearch engine, which then ranks them as a single, new collection. Some variants of centralized ranking transmit each document to the metasearch engine in its entirety. Others make more economical use of network bandwidth by transmitting only the minimal amount of statistical information about each document that is needed to compute the necessary scores. In many cases, a Boolean search initially conducted by each search engine eliminates from consideration any completely irrelevant documents.
Centralized ranking is either network intensive or extremely imprecise. It is network intensive if each search engine returns all documents, or all documents that satisfy the Boolean pass. It is extremely imprecise if a limit is placed on the number of documents returned by each search engine; without performing any scoring or ranking, a search engine cannot prioritize its result list.
To help overcome the extreme imprecision and high bandwidth utilization described above, decentralized ranking techniques are preferred. Decentralized ranking can be carried out by using either local statistics or global statistics. In metasearch implementations that employ local statistics, each search engine initially returns the results that score best with respect to the given query and the individual collection. These results are subsequently combined and manipulated for presentation to the user. Within this rubric there are many subsidiary variations.
a) Some metasearch implementations do not attempt to re-rank results. Instead, they either group results by search server or interleave them.
b) Others attempt to remove duplicates and re-rank results, applying scoring rules with heuristic measures such as the number of duplications across search servers, rank within each search, and concentration of search terms within the title or summary received from each search server.
c) Still others perform a final, centralized ranking at the metasearch engine, treating the union of results returned from the search engines as a single collection.
Regardless of these variations, these implementations suffer from lack of precision when applied across collections with disparate statistics. The scoring function applied within each collection does not in general match the scoring function applied across the metacollection because the local statistics do not match. Thus, in the example described previously, the relatively few documents from the agriculture collection might be ranked higher than the potentially more relevant documents from the computer collection, due to the limited focus of each search engine. Consequently, decentralized ranking with local statistics is most appropriate when the constituent search engines are not under control of the metasearch provider, and each search engine has access to substantially the same content. Those conditions hold true, for example, when combining multiple search engines on the Internet.
To obtain a correct ranking in a network-efficient manner, decentralized ranking with global statistics is most preferable. In this approach, some portion of the computation is executed by the individual search engines using global statistics (statistics that depend on the entire metacollection). The desirability of such metacollection-level statistics has been recognized in the published literature since 1995. See, for example, C. Viles and J. French, xe2x80x9cDissemination of Collection Wide Information in a Distributed Information Retrieval System,xe2x80x9d Technical Report CS-95-02, University of Virginia, Jan. 6, 1995. Particular methods for accomplishing such computations have been refined during the intervening years.
One class of approaches is to precompute all necessary metacollection-level statistics and store them at each search engine, as taught, for example, in U.S. Pat. No. 6,163,782. This approach eliminates request-response delays perceived by the user that would otherwise be incurred during the computation of metacollection-level statistics. However, it is appropriate only if the same metacollection is used for every metasearch. In general, that condition does not hold. With each query, the user might specify a different set of collections to be included in the metacollection. Even if the user does not make such a specification, many metasearch systems perform a database selection process that determines the collections which are most likely to have appropriate results for a given query. Each query might include conditions, such as access rights, that cause different documents to be included within the metacollection. The set of available documents itself might be dynamic, with documents being continuously added, deleted, and edited. Each of these conditions makes it impractical to use pre-computed metacollection statistics.
In a variation of such an approach, only certain metacollection-level statistics, such as document frequencies, are precomputed and stored at each search engine. Such approaches are subject to similar limitations.
Thus, for correctness and practicality it is preferable for at least some of the metacollection-level statistics to be dynamically computed in response to each query. It is the goal of the present invention to provide such correctness and generality. It is a further goal to do so in a manner that minimizes both perceived latency and consumed memory, processing, and network resources. These goals become increasingly important as the scale of the collections, metacollection, and result sets increases.
In accordance with the invention, these objectives are achieved by means of a multi-phase approach in which local and global statistics are exchanged between the search engines and the metasearch engine. In the first phase, the query is transmitted to the search engines from the metasearch engine, and each search engine computes or retrieves previously-computed local statistics for those terms in its associated document collection. In the second phase, each search engine returns its local statistics. A third phase consists of computing metacollection level statistics at the metasearch engine, based upon the information contained in the local statistics. The metacollection level statistics are disseminated to the search engines. In the final phase, the search engines determine scores for the documents in their respective collections pursuant to the metacollection level statistics, and transmit document references to the metasearch engine. The metasearch engine merges the results from the individual search engines, to produce a single ranked results list for the user.
By dynamically computing the local and metacollection statistics at the time a query is presented, the present invention maximizes precision for the collections that are being searched, without compromising recall. In addition, the amount of data that is transmitted over the network is kept to a minimum, thereby providing efficient bandwidth utilization and reduced latency. Furthermore, the topology of the search mechanism readily supports a multi-tier hierarchy of search engines, thereby facilitating the scalability of the metasearch system to any number of document collections and search engines.
Additional features of the invention which contribute to both the efficacy of the search and the efficiency of network communications are described hereinafter with reference to exemplary embodiments depicted in the drawings.