Search engines are a commonly used tool for identifying relevant documents from indexed document collections stored locally on disk or remotely over a private or public network, such as an enterprise network or the Internet, respectively. In a document search, a user typically enters a query into a search engine. The search engine evaluates the query against the document collection and returns a set of candidate documents (i.e., a filtered set) that matches the query. If the query is made through a web browser, for example, then the filtered set may be presented as a list of uniform resource locators (“URLs”).
A typical query includes one or more keywords. The search engine may search for the keywords in numerous sources, including the body of documents, the metadata of documents, and additional metadata that may be contained in data stores (e.g., anchor text). Depending on the implementation, the search engine may search for documents that contain all of the keywords in the query (i.e., a conjunctive query) or for documents that contain one of more of the keywords in the query (i.e., a disjunctive query). In order to process the queries efficiently, the search engine may utilize an inverted index data structure that maps keywords to the corresponding documents. The inverted index data structure enables a search engine to easily determine which documents contain one or more keywords.
For large collections of documents, the cardinality of the candidate documents can be very large (potentially in the millions), depending on the commonality of the keywords in the query. It would be frustrating for users if they were responsible for parsing through this many results. In order to reduce the number of search results and to provide more relevant search results, many search engines rank the candidate documents according to relevance, which is typically a numerical score. In this way, the search engine may sort results according to ranking and return only the most relevant search results to the user. The relevance may be based upon one or more factors, such as the number of times a keyword appears in a document and the location of the keyword within the document.
While numerous methodologies exist for ranking candidate documents, these methodologies typically rank the entire filtered set. When the filtered set is sufficiently large (e.g., when the collection of documents is large and the query includes common words), ranking the entire filtered set can be resource intensive and create performance problems. In particular, not only can the ranking operation be computationally expensive, but reading the necessary data from disk to rank the candidate documents can be time consuming. By reducing the number of candidate documents in the filtered set, the ranking operation can be more efficiently performed and the amount of data read from disk can be significantly reduced. However, randomly removing candidate documents from the filtered set may eliminate potentially relevant search results.
It is with respect to these considerations and others that the disclosure made herein is presented.