For more than twenty years, information services have provided access to multiple databases. For example, Dialog Information Services, now known as Knight-Ridder Information, Inc., provides several hundred databases (a.k.a collections) available to searchers. Some of these databases contain bibliographic abstracts, while others contain full-text documents. A searcher is able to apply a query to one or to a plurality of databases. At the outset, the searcher selects individual databases which are of interest, based on past experience, or selects a group of databases, selected by the information provider and related to a particular topic. For example, a searcher might select the topic of patents, a topic for which the information service has grouped a number of databases specific to patents. When a query is applied to the group of databases, the information service retrieves the number of hits in each database. The searcher then accesses databases of interest to view individual records. This system was originally designed for librarians and professional researchers who know where to look for desired information.
As wide area networks, such as Internet, become available, new opportunities in searching have become available, not only to searching professionals, but to lay users. New types of information providers are arising who use public, as well as private, databases to provide bibliographic research data and documents to users. When a user has an interest in a topic, such as patents, he may not know what resources can be assembled for a search, nor the location of the resources. Since the resources frequently change, a searcher will have less interest in the source of the reply compared to the relevance of the reply. It has been recognized by others that distributed collections, available over wide area networks, can be treated as a single collection. Each sub-collection is searched individually, and the reports are combined in a single list. It has also been recognized by others that documents can be ranked by search engines in accord with an algorithm and given a weight, taking into account the nature of a particular collection. Document scores can be normalized to obtain scores that would be obtained if individual document collections were merged into a single, unified collection.
One of the problems that exists in the prior art is that the scores for each document are not absolute, but dependent on the statistics of each collection and on the algorithms associated with the search engines. A second problem which exists is that the standard prior art procedure requires two passes. In a first pass, statistics are collected from each search engine in order to compute the weight for each query term. In a second step, the information from the first step is passed back to each search engine, which then assigns a particular weight or score to each hit or identified document. A third problem that exists is that the prior art requires that all collections use the same search engine.
An object of the invention was to devise a method for searching multiple collections on a single pass, with ranking of documents on a consistent basis so that if the same document appears in two different databases, it would be scored the same when the results are merged. It is not required that the same search engine be used for all collections.