1. Field of the Invention
The present invention relates generally to information retrieval and, more specifically, to a system providing improved methodologies for performing information retrieval using Bayesian technique.
2. Description of the Background Art
The World Wide Web (or “Web”) contains a vast amount of information in the form of hyper-linked documents (e.g., web pages) loosely-organized and accessed through the “Internet.” The vast majority of the information includes hyperlink “shortcuts” to other information located in other hyperlinked documents. The unstructured nature and sheer volume of data available via the Internet makes it difficult to navigate efficiently through related information while avoiding irrelevant information.
To cope with the vast information available, users often employ a computerized search engine to sort through the large quantity of information accessible via data networks such as the Internet. Here, the user simply provides the search engine with a search request or query—that is, a set of one or more search terms for the desired topic. In response, the search engine typically performs a text-based search and returns the most relevant information located. Search engines typically return results as a list of web sites, sorted by relevance to the user's search terms. For example, Google™ and Yahoo™ search engines display search result pages that list links associated with the located web pages, together with a brief description of the content provided by the web pages.
As the Web continues to grow, the goal of providing highly relevant search results in real-time becomes increasingly important. One approach to improving search engine performance is to distribute the query (and associated work) among multiple machines. A distributed query is one in which the query may access underlying base data from multiple network locations; various distributed query models exist today. For simple keyword information retrieval (IR) systems, such as Boolean IR systems where documents either match or they do not, there are a number of architectures for distributing the indexed data and the query processing.
The simplest model is Central Data, Central Query Processing. Here, all document indexing and query processing is done on a central server. However, the solution is not scalable and therefore is not a viable approach for use in a next generation search engine. A slightly more complicated model is Remote Data, Central Query Processing. In this model, document indexing and data storage is done remotely, whilst all the data needed for the query processing is transported across the network from each remote server to the central server. However, this solution is no longer viable, as it suffers from excessive network traffic. As more and more data is indexed, more and more calculation data is required to be sent over the network to the central processing server. For example, if 10 remote servers each contained 500,000 documents, there could be over 5 million integers being passed to the central server for each term in the query. For 10 query terms, this is 50 million integers. If each integer requires at least three bytes (e.g., to account for more than 65,536 documents), then each query is using a staggering 143 MB of memory. Remote Data, Remote Query Processing attempts to improve upon that approach by performing both the document indexing and query processing remotely. The lists of matching documents are combined on a central server before being returned to the user. However, the model is only possible when the local query processing is unaffected by remote query processing being performed on other servers.
Apart from architectural considerations (i.e., central versus distributed), different information retrieval strategies are available, including keyword searching, Bayesian searching, and linguistic-based searching. Today, most enterprise-level searches utilize either keyword searching or Bayesian searching. The main difference between the two is that with keyword searching one may calculate the search results independently of other considerations. With Bayesian searching, on the other hand, a knowledge of the environment (e.g., number of documents indexed throughout an entire system) is required. This difference is perhaps best illustrated by way of example.
Consider, for instance, a Boolean keyword search for documents containing the word “Java,” where a first machine contains 100 matches and a second machine contains 200 matches. Here, the number of matches on the first machine (i.e., 100 matches) is independent of what is happening on any other machine. Thus in this simple example of a distributed Boolean search, there is no problem with indexing and searching of documents that exist on many remote machines (e.g., servers). Therefore, with that architecture (i.e., Boolean keyword searching), one may employ a central hub together with many remote servers, each performing its own independent indexing (i.e., independent of the other servers). When a query is performed, each remote server can simply returned its own local results (i.e., yes/no for individual documents) to the central hub, which may in turn simply tabulate a list of top results (i.e., by simple combination of the local results of each remote server). Importantly, with a simple Boolean keyword search there is no problem with a distributed type of search, as each local result can be computed independently.
With a Bayesian search, in contrast, searching is performed in a manner that relies on all the information that is gathered about all the servers. Consider, for instance, a query for “Java” where a first machine has 100 documents 10 of which contain “Java”, and a second machine has 200 documents 50 of which contain “Java.” With a Bayesian calculation, instead of simply returning a yes or no match (for the given query), the system actually attempts to determine the relevance as well (i.e., how well the documents match). The calculation that determines how well a document matches needs to know not only how important the search term (e.g., “Java”) is on the current machine (e.g., first machine) but also how important the search term is on all the other machines where there are documents indexed (e.g., second machine and so forth). In particular, the Bayesian calculation relies on giving each term in a given query its own term weighting. The term weighting can in fact change over time, as more documents are indexed. Also, each individual server can have an affect on the overall term weighting.
Today, existing solutions do not solve the problem of how to get each server participating in a Bayesian distributed search system to return the same accurate relevance score for different documents. In order to achieve enterprise-level scalability, it is necessary to distribute both document indexing and query processing on separate servers. Given that, a problem remains as to how the local results (i.e., matching documents) come back with a correct score from each server.
The current best solution tries to estimate how accurate each individual server's calculation is likely to be. For example, if a first server has 100 documents indexed and a second server has 1,000,000 documents indexed, the current best solution assumes that the server that has the most data indexed (i.e., the second one, which has the larger number of documents indexed) is going to be more correct. The reason that this can be done is that with the Bayesian calculation, the larger the sample size the more accurate the result. Therefore, the current best solution does not try to alter the calculations on individual servers, but instead tries to scale all of the results coming back from the various servers. With this approach, each server is effectively given a particular weighting (i.e., scaling factor). Unfortunately, the results obtained by applying a scaling factor leave a lot to be desired.
The basic problem is that the approach is too raw. Given a lot of documents on one server and a lot of documents on another server, if for some reason the scaling factor is less than completely accurate (e.g., due to insufficient data), then all the documents from one server are inaccurately scaled, in effect, tarring all of the documents with the same brush. For example, suppose 100 documents are returned from a particular server but because of scaling the results from that server are not trusted, so the score for every one of those documents is reduced by 50%. It may turn out, however, that the top documents (e.g., top 10) from that particular server could have a very good score. Therefore, the approach of applying a global scaling factor may incorrectly reduce the score of documents that in fact are highly relevant matches (i.e., should have a good score). In effect, good scores are lumped in with the bad scores. All told, the approach is very temperamental, depending very much on documents that are indexed on servers. The approach does not take into account documents indexed on other servers to a fine enough granularity.
Most distributed IR systems do not use Bayesian statistics to produce lists of matching documents, and therefore do not suffer from the foregoing distributed query calculation problem. However at the same time, an approach based on Bayesian technique is likely to provide more relevant results—that is, results that are more pertinent to the user's task at hand. What is needed is a system incorporating methodologies for performing Bayesian-based information retrieval, but doing so in a manner that avoids the distributed query calculation problem. The present invention fulfills this and other needs.