The traditional Information Retrieval (IR) framework is one where a single search engine retrieves documents from a single document collection. However, this can be generalized to the case where one or more search engine queries one or more document collections, and the results from the different dataset and search engine pairs are merged to form a single ranking of documents. This setting is known as distributed information retrieval.
In general, the case where a single engine queries several collections is known as federation, whereas the case of multiple search engines querying the same collection is known as metasearch.
The problem in merging search results is that both search engines and document collections are not created equal. Search engines differ in their indexing methods, term weighting schemes, and document weighting schemes, while document collections differ in the type and relevancy of the documents they contain.
In the Information Retrieval field the process of fusion is usually divided into three phases: collection selection, document selection, and merging. The aim in collection selection is to narrow the queried datasets to the most relevant collections, thus reducing the amount of noise present. Document selection is the process of deciding how many documents should be retrieved from each collection, the simplest being an identical number of documents. Finally, merging is the process of generating a unified list from the retrieved documents.
Most search engines provide very little information, in addition to the document rankings, with which to perform the merging. The document score (DS) assigned by a search engine to a document retrieved from a collection might or might not be provided. In the former case, the DS can be used as additional information for merging. However, it is difficult to re-rank the documents since DSs are local for each specific dataset and engine combination. This can be avoided by computing global statistics, for example, the IDF (inverse document frequency) of the query terms as though all datasets were merged to a single collection. In the latter, only the document ranking and some a priori knowledge about the datasets can be used for merging the different results sets.
Selberg, E. & Etzioni, O. (1995), “Multi-service search and comparison using the MetaCrawler”, Proceedings of the 4th International World-Wide Web Conference, Darmstadt, Germany, utilizes document rank and its appearance in the results list of several engines to perform merging. This is done by summing the rank of duplicate documents. Other approaches to the problem of merging are achieved by assigning a weight to every ranking, such that each collection is given a score based on its statistics. This ranking is then used for merging the different rankings by weighting the DSs.
Two known algorithms that use this approach are CORI (J. Callan, Z. Lu, and W. Croft, “Searching distributed collections with inference networks.”, in proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21-28, Seattle, Wash., 1995.), which is applicable to the framework of federation, and ProFusion (Gauch, S., Wang, G., & Gomez, M. (1996), “Profusion: Intelligent fusion from multiple, distributed search engines.”, Journal of Universal Computing, 2, 637-649), created for metasearch. CORI requires, in addition to DSs, term probabilities and IDF's. ProFusion creates an engine-specific weight by measuring the precision at 10 (P@ 10) of each search engine over a known set of 25 queries.
More recently, Joachims, T. (2002), “Optimizing search engines using clickthrough data.” Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), Association of Computer Machinery, demonstrated a user-driven approach to metasearch. This system learns particular users' preference based on past user activity to assign weight to individual search engines. Thus, this system is similar to ProFusion, the main difference being that weights are assigned based on individual user preference rather than search engine precision.
The proposed approach is based on a method of providing an estimation of the success a search engine working on a dataset had on a given query. This estimation is used to decide which search engine and dataset pair are more likely to have retrieved better documents, and thus the documents retrieved from them should be ranked higher.
The approach is based on the assumption that only minimal information is supplied by the search engine operating on a specific dataset. Access may be provided to the score of documents (i.e. the DSs) or document ranks and to the document term frequency (DF) of all query terms. Thus the method we describe uses less information than the prior art methods.