1. Field of the Invention
The present invention generally relates to systems and methods for caching web documents.
2. Background
A search engine is an information retrieval system used to locate documents and other information stored on a computer system. Search engines are useful at reducing an amount of time required to find information. One well known type of search engine is a web search engine which searches for documents, such as web pages, on the “World Wide Web.” The World Wide Web is formed by a large number of interlinked documents hosted on computer systems that are accessible over the Internet. Other types of search engines include personal search engines, mobile search engines, and enterprise search engines that search on intranets.
Web search engines can provide fast and accurate results to user queries, usually as a list of web documents. The web search engine usually provides the results by identifying the web documents that result from the user query and then locating and retrieving the web documents from storage location(s). In order to provide fast retrieval of web documents from the storage location(s), the web search engine may access a cache that stores most frequently accessed web documents.
Development of a search engine that can index a large and diverse collection of documents, yet that has the ability to return to a user a list of resulting web documents in a timely manner in response to a query has been recognized to be a difficult problem. A user of a search engine typically supplies a short query to the search engine, the query containing only a few terms, such as “hazardous waste” or “country music.” The search engine attempts to return a list of relevant documents in a timely manner. Although the search engine may return a list of tens or hundreds of documents, most users are likely to only access the top documents (e.g., 10-100) on the list.
Thus, to be useful to a user, it is desired that a search engine (e.g., a web search engine) would be able to access and/or retrieve, from potentially billions of web documents, the top resulting web documents in a timely manner, in result to any query submitted by the user. A storage system may store the most frequently accessed web documents in a cache that is easily accessible by the web search engine. However, a search engine may receive millions, or even billions, of different user queries that potentially correspond to millions, or even billions, of resulting web documents. It is difficult to efficiently store each resulting web document out of billions of web documents that correspond to results of billions of user queries such that the retrieval time for each web document is minimized Thus, it would be beneficial to efficiently store the billions of resulting web documents that correspond to billions of user queries such that the retrieval time for each web document is minimized.