The quantity and diversity of information and services available over the public (Internet-type) and private (Intranet-type) local and wide area networks, generically referred to as the “Internet,” has grown substantially. A number of independent Internet search services exist to provide context based, content derived indexes searchable over the Internet through a query based interface. In particular, the variety of information accessible through such Internet based services is growing rapidly both in terms of scope and depth.
Access to certain information available through the Internet may be free of charge, such as from Wikipedia or Google, but access to information in much of the press release field may be available only on a for fee basis. In order to maximize the desirability for users to access a particular fee-based collection and preferably related sets of fee-based collections, a collection access provider will acquire licensed rights to make available a wide variety of individual collections of content related documents as discrete databases that can be manually selected for search by a user. Typically, searches and retrievals of information from the discrete databases are subject to specific access fees determined based on the relative commercial worth of the information maintained in the individual databases. Consequently, access fees are typically calculated on the number of documents that are variously searched, reviewed, and retrieved in preparation of a search report from a particular database.
A known problem in providing access to multiple databases is the relative difficulty or inefficiency in identifying an optimal database or set of databases that should be searched to obtain the best search report for a particular unstructured, or ad hoc, database query. In order to support even the possibility of ad hoc queries, the database search must be conducted on a full text or content established basis. Existing full text search engines typically allow a user to search many databases simultaneously. Consequently, the selection of a most appropriate set of databases to search places a substantial burden on the user for each query. The user must manually determine and select a particular set of databases that must, by definition, contain the desired results to a query. Such a database set selection is difficult since the selection is made preemptively and independent of the query. This burden may be even more of an issue where access fees are charged for conducting a search against a database even where no search responsive documents are found or examined. In the aggregate, this problem is typically referred to as the “collection selection problem.”
Previous work in the related fields that attempt to solve the “collection selection problem” has centered on optimizing federated (multiple database) search by deciding which databases of a number to search, such as described in U.S. Pat. No. 5,845,278 to Kirsch et al., filed Dec. 1, 1998 and titled “Method for automatically selecting collections to search in full text searches.”
In the press release field (and similar fields, such as investor relations), an organization trying to track the “viral spread” of a press release may purchase up to thousands of news feeds from websites, but the set available numbers in the tens-of-thousands. The organization attempts to find, in some ad-hoc way, without any particular theory of optimization, and without the use of any systematic algorithm, the most key news feeds, but avoid those news feeds that carry only duplications of stories that appear elsewhere. Conventional methods for avoiding duplicate stories include taking sample queries and measuring how many stories are returned from one wire, and averaging a relevance coefficient of each story. Then the next wire is queried, near duplicates with stories already retrieved are discarded, and again the average relevance is computed. This proceeds, and the wires with the highest total relevance are selected after some arbitrarily chosen number have been examined.
It is desirable to improve upon such techniques in a number of ways, in particular, to develop a systematic algorithm that overcomes the “collection selection problem,” and more importantly, that is operable to combine seemingly less relevant and less-inclusive content sources (neither of which would be selected on its own) into a more comprehensive, relevant, and less costly content source.
It is also desirable that this method be operable to select a subset of content sources (e.g., newswires) from a large collection of content sources to minimize expenditure by avoiding duplicate stories but maintaining high relevance.