Search Engine Context
Typically, in building a search-efficient data collection management system such as web search engines, data items are indexed according to some or all of the possible search terms that may be contained in search queries. Thus, conventionally an “inverted index” of the data collection is created, maintained, and updated by the system. The inverted index will comprise a large number of “posting lists” to be reviewed during execution of a search query. Each posting list corresponds to a potential search term and contains “postings”, which are references to the data items in the data collection that include that search term (or otherwise satisfy some other condition that is expressed by the search term). For example, if the data items are text documents, as is often the case for Internet (or “Web”) search engines, then search terms are individual words (and/or some of their most often used combinations), and the inverted index comprises one posting list for every word that has been encountered in at least one of the documents.
Search queries, especially those made by human users, typically have the form of a simple list of one or more words, which are the “search terms” of the search query. Every such search query may be understood as a request to the search engine to locate every data item in the data collection containing each and every one of the search terms specified in the search query. Processing of a search query will involve searching through one or more posting lists of the inverted index. As was discussed above, typically there will be a posting list corresponding to each of the search terms in the search query. Posting lists are searched as they can be easily stored and manipulated in a fast access memory device, whereas the data items themselves cannot (the data items are typically stored in a slower access storage device). This generally allows search queries to be performed at a much higher speed.
Typically, each data item in a data collection is numbered. Rather than being ordered in some chronological, geographical or alphabetical order in the data collection, data items are commonly ordered (and thus numbered) within the data collection in descending order of what is known in the art as their “query-independent relevance” (hereinafter abbreviated to “QIR”). QIR is a system-calculated heuristic parameter defined in such a way that the data items with a higher QIR value are statistically more likely to be considered by a search requester of any search query as sufficiently relevant to them. The data items in the data collection will be ordered so that those with a higher QIR value will be found first when a search is done. They will thus appear at (or towards) the beginning of the search result list (which is typically shown in various pages, with those results at the beginning of the search result list being shown on the first page). Thus, each posting list in the inverted index will contain postings, a list of references to data items containing the term with which that posting list is associated, with the postings being ordered in descending QIR value order. (This is very commonly the case in respect of web search engines.).
It should be evident, however, that such a heuristic QIR parameter may not provide for an optimal ordering of the search results in respect of any given specific query, as it will clearly be the case that a data item which is generally relevant in many searches (and thus high in terms of QIR) may not be specifically relevant in any particular case. Further, the relevance of any one particular data item will vary between searches. Because of this, conventional search engines implement various methods for filtering, ranking and/or reordering search results to present them in an order that is believed to be relevant to the particular search query yielding those search results. This is known in the art as “query-specific relevance” (hereinafter abbreviated “QSR”). Many parameters are typically taken into account when determining QSR. These parameters include: various characteristics of the search query; of the search requester; of the data items to be ranked; data having been collected during (or, more generally, some “knowledge” learned from) past similar search queries.
Thus, the overall process of executing a search query can be considered as having two broad distinct stages: A first stage wherein all of the search results are collected based (in part) on their QIR values, aggregated and ordered in descending QIR order; and a second stage wherein at least some of the search results are reordered according to their QSR. Afterwards a new QSR-ordered list of the search results is created and delivered to the search requester. The search result list is typically delivered in parts, starting with the part containing the search results with the highest QSR.
Typically, in the first stage, the collecting of the search results stops after some predefined maximum number of results has been attained or some predefined minimum QIR threshold has been reached. This is known in the art as “pruning”; and it occurs, as once the pruning condition has been reached, it is very likely that the relevant data items have already been located.
Typically, in the second stage, a shorter, QSR-ordered, list (which is a subset of the search results of the first stage) is produced. This is because a conventional web search engine, when conducting a search of its data collection (which contains several billions of data items) for data items satisfying a given search query, may easily produce a list of tens of thousands of search results (and even more in some cases). Obviously the search requester cannot be provided with such an amount of search results. Hence the great importance of narrowing down the search results actually provided to the requester to a few tens of result items that are potentially of highest relevance to the search requester.
In order to address the ranking needs required for proper operations of web search engines such as, for example but without being limited thereto, the generation of QIR values and/or QSR values, multiple constructions of ranking models have been developed over the recent years. These ranking models may enable ranking of documents (e.g., web pages, text files, image files and/or video files) according to one or more parameters. Under some approaches, machine-learning algorithms are used for construction and operations of ranking models and are typically referred to as Machine-learned ranking (hereinafter abbreviated to “MLR”). As one person skilled in the art of the present technology may appreciate, MLR is not limited to web search engines per se but may be applicable to a broad range of information retrieval systems.