The increasing demand for Web-based search engines and services has helped fuel a corresponding demand for ever more accurate and cogent search results. A number of Web search services exist which permit a user to type in desired search terms and in response be presented with a ranked list of Web sites containing material which is potentially relevant to that search. As the number of available Web pages has grown into the billions, the necessity for an accurate assessment of the relatedness or relevance of Web site results to a user's search terms has grown even more acute.
There are a variety of known techniques for assessing the relevance of search results to a user's query. Those include the examination of search or query logs, for instance logs stored on a server, to examine search terms, the resulting list of ranked results and the user's ultimate click-through or other selection choices from amongst that list. The results which users most frequently choose to access may be empirically presumed to be the results which they as consumers of the information judged to be most relevant to their query.
Results may also in cases be evaluated by teams of human search analysts, who may review search terms and results and formulate their own assessment of the relatedness of search hits to that query. Human-annotated results may require more time to complete and therefore may sometimes not represent a practical large-scale or real-time rating scheme. However, because of the ability of human reviewers to reach judgments about relevance with greater flexibility than many algorithmic or heuristic approaches, those relevance measures may be considered equally or more likely to be accurate than other metrics when available.
Similarly, a company deploying a search service may perform evaluations of the quality of its search relevance algorithms by consulting the result rankings for the same or similar searches produced by other public or commercial Web search engines or sites. A significant divergence between the relevance ratings generated by two separate search services may indicate that the evaluation methods of one or both engines may be inaccurate or incomplete.
A company deploying a search service may desire to make the quality and accuracy of its search results as high as possible, to attract more users, deliver greater satisfaction to those users, and to make the search experience as efficient as possible. Service providers in that position may therefore wish to consult the relevance measures generate by various diverse sources such as query logs, human-annotated ratings, other search service ratings and other sources, in order to assess and improve the accuracy of their own engines. For example, providers may wish to assimilate those diverse relevance ratings to train the self-learning or other heuristics or algorithms employed in their search infrastructure, to adjust weights and other functions to generate more accurate and satisfactory results.
However, no mechanism exists to access and assimilate the relevance ratings of disparate sources to generate a higher-level, composite rating or “ideal set” of relevance rating data. This is in part because diverse sources of relevance ratings may each generate or encode a ranking of relevance in a different scale than the others, making direct comparisons, averaging or other aggregate processing impossible. Other problems in search technology exist.