Today's large data centers manage collections of data comprising billions of data items. In such large collections of data items, searching for particular items that meet the conditions of any given search query is a task that consumes a significant amount of computing resources and takes a noticeable amount of time.
Search Engines—General Discussion
Typically, in building a search-efficient data collection management system, data items are indexed according to some or all of the possible search terms that may be contained in search queries. Thus, conventionally an “inverted index” of the data collection is created, maintained, and updated by the system. The inverted index will comprise a large number of “posting lists” to be reviewed during execution of a search query. Each posting list corresponds to a potential search term and contains “postings”, which are references to the data items in the data collection that include that search term (or otherwise satisfy some other condition that is expressed by the search term). For example, if the data items are text documents, as is often the case for Internet (or “Web”) search engines, then search terms are individual words (and/or some of their most often used combinations), and the inverted index comprises one posting list for every word that has been encountered in at least one of the documents.
Search queries, especially those made by human users, typically have the form of a simple list of one or more words, which are the “search terms” of the search query. Every such search query may be understood as a request to the search engine to locate every data item in the data collection containing each and every one of the search terms specified in the search query. Processing of a search query will involve searching through one or more posting lists of the inverted index. As was discussed above, typically there will be a posting list corresponding to each of the search terms in the search query. Posting lists are searched as they can be easily stored and manipulated in a fast access memory device, whereas the data items themselves cannot (the data items are typically stored in a slower access storage device). This generally allows search queries to be performed at a much higher speed.
QIR & QSR
Typically, each data item in a data collection is numbered. Rather than being ordered in some chronological, geographical or alphabetical order in the data collection, data items are commonly ordered (and thus numbered) within the data collection in descending order of what is known in the art as their “query-independent relevance” (hereinafter abbreviated to “QIR”). QIR is a system-calculated heuristic parameter defined in such a way that the data items with a higher QIR value are statistically more likely to be considered by a search requester of any search query as sufficiently relevant to them. The data items in the data collection will be ordered so that those with a higher QIR value will be found first when a search is done. They will thus appear at (or towards) the beginning of the search result list (which is typically shown in various pages, with those results at the beginning of the search result list being shown on the first page). Thus, each posting list in the inverted index will contain postings, a list of references to data items containing the term with which that posting list is associated, with the postings being ordered in descending QIR value order. (This is very commonly the case in respect of Web search engines.)
It should be evident, however, that such a heuristic QIR parameter may not provide for an optimal ordering of the search results in respect of any given specific query, as it will clearly be the case that a data item which is generally relevant in many searches (and thus high in terms of QIR) may not be specifically relevant in any particular case. Further, the relevance of any one particular data item will vary between searches. Because of this, conventional search engines implement various methods for filtering, ranking and/or reordering search results to present them in an order that is believed to be relevant to the particular search query yielding those search results. This is known in the art as “query-specific relevance” (hereinafter abbreviated “QSR”). Many parameters are typically taken into account when determining QSR. These parameters include: various characteristics of the search query; of the search requester; of the data items to be ranked; data having been collected during (or, more generally, some “knowledge” learned from) past similar search queries.
Thus, the overall process of executing a search query can be considered as having two broad distinct stages: A first stage wherein all of the search results are collected based (in part) on their QIR values, aggregated and ordered in descending QIR order; and a second stage wherein at least some of the search results are reordered according to their QSR. Afterwards a new QSR-ordered list of the search results is created and delivered to the search requester. The search result list is typically delivered in parts, starting with the part containing the search results with the highest QSR.
Typically, in the first stage, the collecting of the search results stops after some predefined maximum number of results has been attained or some predefined minimum QIR threshold has been reached. This is known in the art as “pruning”; and it occurs, as once the pruning condition has been reached, it is very likely that the relevant data items have already been located.
Typically, in the second stage, a shorter, QSR-ordered, list (which is a subset of the search results of the first stage) is produced. This is because a conventional Web search engine, when conducting a search of its data collection (which contains several billions of data items) for data items satisfying a given search query, may easily produce a list of tens of thousands of search results (and even more in some cases). Obviously the search requester cannot be provided with such an amount of search results. Hence the great importance of narrowing down the search results actually provided to the requester to a few tens of result items that are potentially of highest relevance to the search requester.
The Quorum Rule
Conventional Web search engines face a problem, however, in that human search requesters may express their search queries in an imprecise or sub-optimal way. For example, a search query may contain a search term T (e.g. “perambulator”) that relates to the “theme” of the search (e.g. “baby carriages”—the search requester is looking for some information related to baby carriages) instead of the more common expression “strollers”. Because the expression “perambulator” is much less common in every day usage than the expression “strollers”, many data items that would actually be relevant search results (with respect to the theme of the search) would not actually be located in a specific search that required the presence of the world “perambulator” in order to be a result item, because they contain the expression “stroller” instead of the search term “perambulator”. Thus, in the first (e.g. result item collecting) stage of the search, many data items relevant to the theme of the search would not even be included in the search result list, because they would not contain the term T.
To handle this issue (and for other reasons which are not relevant to the present discussion), conventional search engines are often configured to search not only for data items including occurrences of each one of the search terms of the search query, but also to search for additional data items that lack one or more of search terms (e.g. less significant search terms), while nonetheless containing all of the others of the search terms (e.g. more significant search terms). This is known in the art as “the quorum rule”.
In a very general form, the quorum rule consists in heuristically assigning different weights to each of the individual search terms T1, T2, . . . Tn in a search query, and setting a threshold “quorum value” that is less than the total weight of all of the search terms (i.e. the sum of the individual weights of each and every one of the search terms). Data items that attain the search quorum weight value are considered as valid search results notwithstanding the fact that they may be lacking one or more of (e.g. the less significant) search terms (in this example the “less significant” search terms being those terms having the least weights). (The search quorum weight value of any particular data item with respect to any particular search is the sum of the weights of the search terms actually found in that data item.)
As an example, a very simple form of the application of the quorum rule that may be used for the purposes of illustration may be as follows: The search terms of a search query may be divided into insignificant search terms (e.g. words very uncommonly used in American English, words that are extremely commonly used in American English, prepositions, conjunctions, articles, auxiliary verbs, etc.) and significant search terms (e.g. all search terms other than the insignificant search terms). The total of number of significant search terms in the search query may then be represented by the variable K, and each of the significant search terms may then be assigned an equal weight of 1/K. A threshold quorum value may be established as %. Thus, any data item whose weight with respect to a particular search query is at least ⅔ will qualify as a search result for that search query; and any data item whose weight is less than ⅔ will not qualify as a search result for that search query. (Thus, with respect to the above example, a data item might qualify as a search result with respect to a search containing the search term “perambulator” (a word very uncommonly used in American English) were it to contain all of the search terms of the search query even if it were missing the word “perambulator”.)
Applying the quorum rule at the first stage of a search (i.e. the search result item collection stage) generally increases the total number of search results collected (as compared with the case where only those data items including every one of the search terms are included in the search results) because in this case not all the search terms need appear in a data item for that data item to be a result of the search. Thus, applying the quorum rule at the first stage of the search makes the second stage of the search process (QSR-ranking) even more important, yet at the same time more difficult to perform, as there are many more data items than would have been the case had the quorum rule not been applied. To handle this problem, conventional search engines implement ranking algorithms based on machine learning principles, using not only information that can be deduced from the search query then currently being executed, but also using a large amount of information collected from previous search queries.
Click-Through Data
In this respect, one very important type of such information from previous search queries is what is termed “click-through” data. At the end of any search query execution, the search requester is usually presented with a search engine result page (“SERP”) that shows a portion of the search results. On the SERP, each data item being a search result is typically shown with its title, a hyperlink to the data item's location on the Internet, and a “snippet” (a short citation from the body of the data item typically containing some or all of the search terms of the search query). The information shown on the SERP can be used by the search requester in selecting the data items most interesting to them for further inspection. Typically, the search requester selects just a few of the data items, by clicking on their hyperlinks, to open them for further reading. Thus, many other data items are left alone without too much attention having been paid to them. While not every data item clicked on (“clicked through to”) by the search requester will be considered by them as an interesting data item, those “clicked through” data items can nevertheless be considered on average as a group as being of greater interest to the search requester than those data items not clicked through. Such clicked through data items can thus be considered as being of a higher QSR with respect to that search query than the non-clicked through data items.
Such “click through” data is conventionally stored in the search engine's database(s). This information can be very helpful for future similar search queries as it can be used later to improve the QSR-ranking of the search results (for future search queries with the same or mostly the same search terms). When ranking the search results of such a future search query, click-through data from past similar queries can be used to assign the clicked-through data items a higher QSR. Thus, such data items can be shown to the then current search requester before other data items having been found during the result collecting stage (the first stage) of the current search query but that were not clicked-through in the past in respect of similar search queries.
Search Engines—Server Types & Functionalities
In order to provide a better understanding of such conventional search engine systems, referring to FIG. 1, the following example is provided: A typical conventional Internet search engine 10, includes four different types of servers (or groups of servers), shown in FIG. 1 as “web-crawler” server 12, “indexing” server 14, “searching” server 16, and “query” server 18, which are each individually described below.
Web-crawler server 12 implements a conventional Internet “web crawler”, whose function it is to seek out and collect copies of webpages from the World-Wide Web (shown as “Web” 28 in FIG. 1) and store each of those pages as “data items” in the “data items” database 20. For each data item, web-crawling server 12 calculates and stores in the data items database 20 a “query-independent relevance” (“QIR”) value. (In some systems, this functionality may be carried out by a separate server that is independent of the web-crawler server 12.)
Indexing server 14 is a conventional indexing server that (re)numbers the data items in the data items database 20. (Indexing server 14 thus received the QIR value for each data item from the web-crawler server 12.) Indexing server 14 also creates and maintains an inverted index in the data items in the “inverted index” database 22. Thus, indexing server 14 is responsible for actually reviewing each data item and determining what key words are in the data item and then inserting a posting to the relevant posting lists in respect of that data item.
Searching server 16 is a conventional searching server that receives search queries from query server 18 (see below), performs searches across the inverted index stored in the inverted index database 22 in respect of such search queries, and builds a QIR-ordered search result list.
Query server 18 is a conventional query server that receives and parses search queries from search requesters (represented by personal computer 26); and for each search query received, query server 18 initiates a search operation by the searching server 16. Query server 18 obtains the QIR-ordered “search result list” from searching server 16 in respect of the search. Query server 18 calculates for at least some of the data items in the search result list a “query-specific relevance” (“QSR”), and query server 18 builds a QSR-ranked search result list in respect of the search. Query server 18 extracts a “title” and a query-specific “snippet” from the data items database 20 (not particularly shown in the drawings) for each data item in the search result list. Query server 18 delivers to the search requester 26 portions of the QSR-ranked search list, together with their titles and snippets. (Each of the aforementioned functionalities of query server 18 are conventional and are well known in the art.) As is also known in the art, query server 18 further records the search requester's actions of “clicking through” on some of the data items shown to them as part of the search results, and stores appropriate data regarding such click-throughs in its “query database” 24. Query server 18 also searches information regarding past queries in the query database 24 when preparing the search results for a current query and defines the QSR-ranking of at least some search results as a function of the information found in the query database 24 before delivering the search results to the search requester.
Search Engines—Server Operations
Having described the general overall functions of each of the servers 12, 14, 16, and 18, some of the specific operations of the servers 12, 14, 16 and 18 will now be described. In this respect, web-crawling server 12 implements a web crawler that (permanently or periodically—as the case may be) explores the World Wide Web finding new (or recently updated) web pages (illustrated by data path 30). For each such web page that is found a data item is created in the data items database 20 (illustrated by data path 32). In a typical conventional Internet search engine, each data item in the data items database 20 includes a local copy of the corresponding web page on the Internet, a hyperlink to the original web page on the Internet (also called its web address), and a set of data-item attributes that were assigned to the data item during the course of its processing by the search engine system 10. Some of these data-item attributes may be described herein, however others not mentioned herein may also be defined and used by various conventional search engines.
With respect to any new data item, the first operation carried out is to define that data item's QIR value. As QIR values are used for data items ordering, they are typically implemented as a numerical (although not necessarily an integerial) characteristic of a data item. A QIR value is calculated by the search engine system 10 using many different attributes of the data item itself (including, but not limited to, its title, creation date, original web page location, etc.), and using the number and qualities of references to that data item on other Web pages, and likely also using some “historical” data having been “learned” by the system 10 from data items having been previously entered into the system, from previously executed search queries, and other conventionally-used information. In this respect, there exist a few methods that are well-known in the art for defining a QIR value in a practical suitable manner. In most conventional Internet systems, the calculation of a QIR value for each new data item is performed by the web-crawler server 12; however in some others it is performed by a different server, such as, for example, indexing server 14 or a dedicated QIR server.
Each data item stored in the data items database 20 is known within the system 10 by its unique system-assigned identifier, which is typically an ordinal number. Typically the entire collection of data items managed by a large Internet search engine is too large to be contained on one database server, and thus it is customarily split into several database “shards”. Where such is the case, each shard will typically have its own data item numbering scheme and its own logic for performing a search on its portion of the document database. When executing a search query each of the partial per-shard search result lists, once generated, are merged into one common QIR-ordered list, which is then QSR ordered.
Data items are numbered by the system 10 in descending order of their QIR, rather than in the order that they were obtained by the web-crawler server 12. Data items having the same QIR can be numbered in any order, for example in inverse chronological order (the latest data items being assigned lesser numbers, in order to be found before the earlier ones). Hence, if a newly received data item D appears to have its QIR value less than that of an existing data item (say #999), but greater than or equal to the QIR value of the next data item (#1000), then D will be assigned #1000, while the old #1000 will become #1001 and so on. Hence, both the data item numbers and the content of the inverted index (see below) are permanently and periodically updated. Typically the data item (re)numbering operation is performed by the indexing server 120, but this is not required to be the case.
Once a data item (e.g. D) is received by the web crawler server 12, stored in the data items database 20, assigned its QIR value, assigned its data item number (e.g. #1000), it is passed on to the indexing server 14 (data path 34 on FIG. 1) for further processing by the latter (bidirectional data path 36). The indexing server 14 manages its database 22 (bidirectional data path 38), which basically comprises an inverted index of the data item collection contained in the data items database 20.
Postings & Posting Lists
As was described hereinabove, the inverted index basically comprises a number of posting lists. The indexing server 14 inspects the new data item #1000, discerns in it various “searchable terms”, and for each searchable term found in the data item it creates a new entry (e.g. a “posting”) in the appropriate posting list.
A posting in a posting list basically includes a data item number (or other information sufficient to calculate a data item number), and optionally includes some additional data. Every posting list corresponds to a searchable term, and comprises a series of postings referencing each of those data items in the data items database 20 that contain at least one occurrence of that searchable term.
Additional data may also be found in a posting; for example, the number of occurrences of a given searchable term in a given data item; whether this search term occurs in the title of the data item, etc. This additional information may be different depending on the search engine.
Searchable terms are typically, but not exclusively, words or other character strings. A general use Web search engine typically deals with practically every word in a number of different languages, as well as proper names, numbers, symbols, etc. Also included may be “words” having commonly found typographical errors. In the present specification, any such searchable term may be referred to as a “word” or a “term”. For each searchable term that has been encountered in at least one data item, the indexing server 14 updates the corresponding posting list, or creates a new one if the term is being encountered for the first time. Hence the total number of posting lists may be as large as a few million. The length of a given posting list depends on how commonly used the corresponding word is in the data items universe (e.g. on the Internet). A very commonly used word may have a posting list of as long as one billion entries (or even more—there is no limit). (In practical use, when the data items database 20 is split into several “shards”, each shard maintains its own separate inverted index 22, thus greatly reducing the length of posting lists in each shard.)
In each posting list, data item postings are placed in an ascending order of their data item numbers, that is, in the descending order of their QIR. Hence, the process of indexing a new data item D is not limited to inserting the data item number of D, say #1000, into the posting list of every word Ti occurring in D. Rather, when assigning to D an already existing data item number #1000, every existing posting in every posting list, to data item number equal or greater than #1000, must be updated (incremented by 1 in this example). In actuality, conventional search engines typically perform this update operation periodically for batches of data items having been received since the previous time that the inverted index database 22 was updated.
Conventional Execution of Search Queries
Data items stored in the data items database 20 and indexed in the inverted index database 22 can then be searched for. Again with reference to FIG. 1, search queries are made by human users (“search requesters” which are collectively depicted on FIG. 1 by an image of a personal computer 26) and are received by the query server 18 (data path 50 in FIG. 1). The query server 18 parses each search query received into its various search terms (which may include optionally dropping auxiliary words such as prepositions and conjunctions not to be searched for because of their ubiquity), and may also perform some other convention actions. For example, a search query Q1, received at time t0, may comprise four search terms T1, T2, T3, T4. This is denoted as Q1[T1,T2,T3,T4] in FIG. 2.
The query Q1 is then passed by the query server 18 to the searching server 16 (data path 44). The latter basically operates on the inverted index database 22, that is, on the inverted index with its many posting lists. In this example, the search process, or execution of a search query, consists of finding the data item numbers of all those data items that contain occurrences of each search term specified in the search query (as was discussed above this is the simplest form of a search process; in a further example described below a quorum principle will be introduced). Typically this is done by exploring in parallel each of the posting lists corresponding to the search terms of the query, starting from the beginning of each posting list. In the present example, posting lists P1, P2, P3, P4 correspond to the search terms T1, T2, T3, T4 respectively (as shown on the upper part of FIG. 2). (In a more general manner the posting list corresponding to a term Tn is denoted in this specification as Pn). A data item whose number is encountered in each posting list relevant to the search query is considered to be a search result (sometimes also conventionally called a “hit”), and is placed in a search result list as the search result list's then next element (i.e. after hits already having been placed in the result list). In this way, the search result list of a search query is in ascending order of data item numbers, and thus in descending order of QIR value.
This procedure of finding further search results stops either when reaching the end of one of the posting lists, or when some “pruning condition” (as was mentioned above) has been satisfied. In various conventional examples, the pruning condition might, for example, be defined by the query server 18 on a per query basis and provided with each query Q by the query server 18 to the searching server 16; alternatively the pruning condition might be fixed with respect to the system and be the same for all queries. In either case, the pruning condition could be expressed, for example, as a maximum number of data items in the search result list, or as a minimum QIR value for a data item to be included in the search result list, or in another different conventional matter. In any case, application of a pruning condition is supposed to “pick” the best results in terms of their QIR.
The search result list prepared by the search server 16 for a given query, e.g. for Q1, is then sent back by searching server 16 to the query server 18 (data path 42). (In the following description and in FIGS. 2 and 3 the search result list for a query Qm is denoted as “R(Qm)”. In terms of two-stage query execution described above, the first stage—collection of search results—is now terminated, and the second stage, that of ranking, or reordering, of the search result list starts. In this respect, the query server 18, before delivering the results to the search requester, reorders them in a way presumably most suitable for this particular given query, by placing at the highest positions in the list those search results (data items) that have the highest query-specific relevance (QSR) for that particular given query. This QSR-ranking and reordering of the originally QIR-ordered search result list is probably the most sophisticated operation performed by a Web search engine, and the one most influencing final user (e.g. search requester) satisfaction.
In order to define in a best QSR ranking for a particular given query, information from many different sources is taken into account at the same time. Part of the information used assessing the QSR of a data item may be found in the data item itself; for example, the total number of occurrences in the data item of each search term of the given search query; occurrences of two or more of the search terms found in close proximity to each other (e.g. in the same phrase), or, yet better, following each other in the same order as in the search query; search terms found in the title of the document, etc. However, all these are limited-scope criteria that might not necessary reflect the level of “user satisfaction” with a given data item in the context of a given particular query.
Hence, some conventional Web search engines make use of historical information collected from a large quantity of previously executed search queries, and stored in a database. This “query database” is shown on FIG. 1 in association with reference number 24, and accessed by the query server 18 via bidirectional data-path 46. As is known in the art, from each query, diverse information can be extracted, stored and processed, and then used for better QSR-ranking of results for the next query. In the context of the present example, only “click-through” data as was briefly discussed above is considered to be relevant. In this respect, a user U1 having made a search query, say, Q1[T1,T2,T3,T4], receives from the query server 18 a list of search results having been found for the query by the searching server 16 and further having been ranked by the query server 18 (as was previously discussed above). In many cases the list is very long, so it is sent to the user in portions (or “pages”) of, for example, 20 entries each. Every entry is “clickable”, that is, if clicked by the user with their mouse or other pointing device, causes the data item to open, for example, in another window or another tab of the browser application on the user's computer. It is likely beneficial for the user to be provided with a quick glance at each of the search results prior to opening them, so that they do not waste their time having to open data item after data item trying to locate the right one. To that end, the query server 18 typically provides the user with a “snippet”, a short citation (or a few yet shorter fragments collected together) from the data item where the requested search terms occur in a presumably self-explanatory context. After looking at the snippet (as well as the other information provided) the user can decide whether to open the data item (by “clicking through” to it), or not.
Illustration of Conventional Use of Click-Through Data
Upon opening a data item, the user can look at it more carefully and decide whether it is definitely of interest to them or not. While the search engine has no way of explicitly “knowing” whether or not the data item is of interest to the user, the search engine can record the mere fact of the user having clicked-through to a given data item appearing on the search result page. This is because the search result page is typically provided to the user by the search engine in a Web application that is typically programmed in a way that every “click-through” action on the page is first sent back to the search engine (in the present example to query server 18 of the system 10). The query server 18 then redirects the user to the web-page of the requested data item (or, alternatively, shows them a copy of the data item stored in the data items database 20). In this way, the query server 18 is capable of recording all the click-through actions performed by users on search result pages provided to them.
It has been statistically verified that, among search results of a query that have been effectively shown to the query issuer, those that have been clicked-through by them were on average of more interest to them than those not clicked-through. Moreover, the last clicked-through data item in the list, that is, the one after which the user stopped further inspection of the list and did not click through to any other items, has proven to be on average of yet more interest to the user than all the previously clicked-through documents. These statistical considerations and “click-through history” are used for better ranking a search result list for every next search query, by using the “click-through history” from past search queries.
In FIG. 2, the query database 18 stores click-through data from past queries in the form of records <Dk; Qm[T1,T2,T3, . . . Tn]> indicating that the document Dk had been clicked through by the issuer of the query Qm[T1,T2, T3, . . . Tn] when he/she was exploring the search results for that query. Optionally, as is known in the art, there could also be recorded (and then used at same later time) data with respect to the search requester (e.g. their IP address), the query execution time; etc. The above collection of records represents a database that can be sorted by documents clicked through, or by some or all the search terms used in queries, or in any other way.
In FIG. 2, for example, the user U1 issues a query Q1[T1,T2,T3,T4], which is executed by the searching server 16 by examining the posting lists P1, P2, P3, P4 of the search terms T1, T2, T3, T4 (respectively) of the search query Q1. Illustratively, a data item D1 (more exactly, a posting (i.e. a reference) to D1) is found in each of these posting lists; hence D1 is included in the search result list R(Q1) for the query Q1. The search result list is, after some QSR reordering, presented to the user U1. The user U1 clicks through the entry corresponding to the data item D1 in the list, considering that it might be of interest to them. (The fact of a data item having been clicked through is schematically indicated on both FIG. 2 and FIG. 3 by an asterisk “*”.) This information is stored in the query database 24 as a record <D1; Q1[T1, T2, T3, T4]>.
At some later point in time, by comparing queries with “almost the same” search terms, and/or with “mostly the same” search result lists, especially those with “mostly the same” subsets of their “clicked-through” results, the system 10 (namely, its query server 18) can establish some “degree of similarity” among past queries, and also between a next query, e.g. Q2, and some of the past queries, e.g. Q0. As how this occurs is both complicated and conventional the details thereof will not be discussed herein; what is important for present purposes is to understand how information from past queries similar to a current query Q2 is conventionally used to help a search engine to deliver more appropriate results to the current search requester.
In this respect, if a then current query, e.g. Q2, is found to be similar to some past query, e.g. Q1, and if among the search results for Q2 there is a data item D1, for which a record <D1; Q1[ . . . ]> exists in the query database 24, signifying that the document D1 was among the results for Q1 as well, and, moreover, had been clicked through by a past issuer of Q1, then the data item D1 is considered as being of higher QSR for Q2 than other results for Q2 with same or similar other characteristics. In other words, the above criterion of “having been clicked through in one or more past similar queries”, while not decisive, is used as one of the criteria capable of increasing the QSR of D1 for Q2, and hence of pushing D1 higher in the ordered list of search results for Q2. Thus D1 will be shown to the search requester in the search result list at an earlier time (i.e. at a higher position in the list) than it would have been had D1 not previously been clicked through.
This is illustratively shown on FIG. 2. A user U2 (which may be the same as U1 or may be another user) issues a search query Q2[T1,T2,T4,T5] that differs from the previously considered query Q1[T1,T2,T3,T4] in that it does not include the search term T3, but rather includes some other search term T5 instead. Again, the searching server 16 looks through the posting lists corresponding to the search terms, this time the posting lists P1, P2, P4, P5 corresponding to search terms T1, T2, T4, T5 of the query Q2. (In FIG. 2 this is shown in a second image of the indexing database 22, denoted 22(2).) Illustratively, the same document D1 is again found in each of the posting lists; hence D1 is included in the search result list R(Q2) for query Q2. However, this time the result list R(Q2) contains too many other documents of presumably higher relevance to the user U2, for the document D1 to be even shown to them. This is illustratively depicted on FIG. 2 by placing D1 in a lower position within the list R(Q2).
In according to conventional use of click-through data, however, the query server 18 (not shown on FIG. 2), before presenting the result list R(Q2) to the user U2, looks up in the query database 24, and finds there (amongst probably other information) the previously stored record <D1;Q1[T1,T2,T3,T4]> showing that the document D1 had been clicked through in one of the previous queries, namely in the query Q1[T1,T2,T3,T4] that differs from the then present query Q2[T1,T2,T4,T5] by just one of their four search terms. Considering that the fact that it had been clicked through brings some additional value to D1, the query server 18 now upgrades the document D1 to a higher position in the list R(Q2), as shown by a dotted-arc arrow on FIG. 2, such that D1 will now be presented to user U2.
Illustration of Conventional Use of the Quorum Rule
Before continuing on, it is helpful to have an understanding of another concept used in the prior art (and briefly introduced herein above): that of a quorum in multi-criteria data search. Generally, a quorum-based search means that, when executing a search for a multi-criteria query, search results are not only those data items that satisfy all the criteria of a search query, but also other data items that possibly satisfy just some of the criteria, according to a “quorum rule”.
The quorum rule is typically expressed in terms of a minimum value wq for the sum of “weights” of all the search criteria that are satisfied for a given data item, or, more specifically, of all the search terms in the query that are contained in that data item. So, if for a query Qm[T1,T2, . . . Tn] with n terms, the respective weights of the terms are established at w1, w2, . . . wn, then wq will be fixed at some value lower than the sum w1+w2+ . . . +wn, so that some of the data items that do not contain each and every search term T1, T2, . . . Tn, will nevertheless be considered as valid search results, provided that the sum of weights of all terms that such data item does contain is still not lower than the quorum value wq.
In more precise terms,                let OCC(T,D) be a Boolean function indicating the presence of a term T in a data item D, which is equal to 1 when T occurs at least once in D, and is equal to 0 otherwise;        let W(T,Q) be the weight of a term T in a query Q containing that term; and        let W(D, Q) be a weighting function of a document D for a query Q[T1, T2, . . . Tn], defined as:W(D,Q)=W(T1,Q)·OCC(T1,D)+W(T2,Q)·OCC(T2,D)+ . . . +W(Tn,Q)·OCC(Tn,D).        Then a quorum condition means, for a quorum value wq, that every document D, for which W(D,Q)≥wq, is considered a search result for Q.        
As was discussed hereinabove, the simplest form of a quorum rule corresponds to a case where all the search terms in a query have the same weight, and the quorum value is established so as to allow some proportion of terms to be missing in a document. For example, all terms in an n-term query Q may be defined as having the same weight 1 (so that their total weight is n), and the quorum value wq is established at ⅔·n. Another form of functionally the same quorum rule may consist in assigning, for a query Q[T1,T2, . . . Tn] with any number n of search terms, W(T1,Q)=W(T2,Q)= . . . =W(Tn,Q)=1/n (so that their total weight is 1), and establishing wq=⅔. In the subsequent examples this simple form of a quorum rule will mostly be used.