Information retrieval (IR) systems are traditionally evaluated in terms of relevance of webpages to individual queries. Conventional techniques for evaluating IR systems commonly use test collections and standard evaluation measures where judges are asked to assign an absolute relevance assessment to search results.
More recently, pairwise preference judgments for IR evaluation have gained popularity. With such approaches, preference judgments over two search result lists returned responsive to a common query are obtained from judges. In preference judgment based IR evaluation, judges are asked to indicate their preference for a search result list from one of two paired systems instead of providing an absolute evaluation of a system in isolation.
Preference based evaluation can be employed to directly answer the question “will users prefer A over B?” In contrast, standard measurements on test collections can be indirectly used to predict which system will be preferred by users. Preference judgments may also be easier for assessors to make as compared to absolute judgments, which can result in enhanced reliability of such evaluation.
Unlike traditional query document evaluation, collecting preference judgments over two search result lists takes context of documents, and hence interaction between search results, into consideration. Moreover, preference judgments may provide more accurate results as compared to absolute judgments. However, result list preference judgments typically have high annotation costs and are commonly time intensive.