The present invention relates in general to searching a corpus of documents, and in particular to search systems and methods utilizing a Bloom filter for caching query results.
The World Wide Web (Web) provides a large collection of interlinked information sources in various formats including texts, images, and media content and relating to virtually every subject imaginable. As the Web has grown, the ability of users to search this collection and identify content relevant to a particular subject has become increasingly important, and a number of search service providers now exist to meet this need. In general, a search service provider publishes a Web page via which a user can submit a query indicating what the user is interested in. In response to the query, the search service provider generates and transmits to the user a list of links to Web pages or sites considered relevant to that query, typically in the form of a “search results” page.
Query response generally involves the following steps. First, a pre-created index or database of Web pages or sites is searched using one or in more search terms extracted from the query to generate a list of hits (usually target pages or sites, or references to target pages or sites, that contain the search terms or are otherwise identified as being relevant to the query). Next, the hits are ranked according to predefined criteria, and the best results (according to these criteria) are given the most prominent placement, e.g., at the top of the list. The ranked list of hits is transmitted to the user, usually in the form of a “results” page (or a set of interconnected pages) containing a list of links to the hit pages or sites. Other features, such as sponsored links or advertisements, may also be included on the results page.
Such systems, as well as other very large information query systems, require a significant amount of on-demand database processing. For example, when responding to a query, multiple database “join” operations may be performed over several large database tables when searching the index or database of web pages. In such a large database, these operations may take a long time to process and thus extend the user experienced end-to-end response time.
One solution to provide a quicker end-to-end response time has been to pre-compute and cache potential search results. Using such a cache, a front end of a search system can process a user's query and return the result quickly from the cache rather than performing a more extensive and time-consuming search of the entire database. However, such a system has significant disadvantages. First, since users' interests and needs can vary widely, user requested data may be expansive and occupy a significant amount of cache storage. Therefore, caching useful amounts of such data is infeasible in a very large system. Additionally, some search results may be, by the nature of the system and/or the information stored therein, confidential or otherwise restricted to use by certain users or relatively small groups of users. Therefore, caching such results would not be very helpful or efficient.
Thus, it would be desirable to provide a more efficient way to perform searches of a large corpus of information and return results to the end user quickly.