The World Wide Web (“web”) contains a vast amount of information. Locating a desired portion of the information, however, can be challenging. This problem is compounded because the amount of information on the web and the number of new users inexperienced at web searching are growing rapidly. Search engines assist users in locating desired portions of this information by cataloging web pages. Typically, in response to a user's request, the search engine returns references to documents relevant to the request.
A web crawler is typically used to find and retrieve documents (e.g., web pages) on the web. To retrieve a document from the web, the web crawler sends a request to, for example, a web server for a document, downloads the entire document, and then provides the document to an indexer. The indexer typically takes the text of the crawled document, extracts individual terms from the text, and sorts those terms (e.g., alphabetically) into a search index. The web crawler and indexer repeat this process as the web crawler crawls documents across the web. Each entry in the search index contains a term stored in association with a list of documents in which the term appears and the location within the document's text where the term appears. The search index, thus, permits rapid access to documents that contain terms that match search terms of a user supplied search query. To improve search performance, the indexer typically ignores common words, called stop words (e.g., the, is, on, or, of, how, why, etc.) when creating or updating the search index. Existing indexers create a single search index that contains terms extracted from all documents crawled on the web.
Generally, search engines may base their determination of the user's interest on search terms (called a search query) entered by the user. The goal of the search engine is to identify links to high quality, relevant results (e.g., web pages) based on the search query using the search index. Typically, the search engine accomplishes this by matching the terms in the search query to terms contained in the search index, and retrieving a list of documents associated with each matching term in the search index. Documents that contain the user's search terms are considered “hits” and are returned to the user. The “hits” returned by the search engine may be ranked among one another by the search engine based on some measure of the quality and/or relevancy of the hits. A basic technique for sorting the search hits relies on the degree with which the search query matches the hits. For example, documents that contain every term of the search query or that contain multiple occurrences of the terms in the search query may be deemed more relevant than documents that contain less than every term of the search query or a single occurrence of a term in the search query and, therefore, may be more highly ranked by the search engine.