The Internet, which allows access to billions of content items stored on host computers around the world, represents a particularly diverse and large collection of content items. Development of a search engine that can such index a large and diverse collection of content items, yet provide the user a short, relevant result set of content items in response to a query has long been recognized as a problem in information retrieval. For example, a user of a search engine typically supplies a query to the search engine that contains only a few terms and expects the search engine to return a result set comprising relevant content items. Although a search engine may return a result set comprising tens, hundreds, or more content items, most users are likely to only view the top several content items in the result set. Thus, to be useful to a user, a search engine should determine those content items in a given result set that are most relevant to the user, or that the user would be most interested in, on the basis of the query that the user submits.
A user's perception of the relevance of a content item to a query is influenced by a number of factors, many of which are highly subjective. These factors are generally difficult to capture in an algorithmic set of rules that define a relevance function. Furthermore, these subjective factors may change over time, as for example when current events are associated with a particular query term. As another example, changes over time in the aggregate content of the content items available through the Internet may also alter a user's perception of the relative relevance of a given content item to a given query. Users who receive search result sets that contain results not perceived to be highly relevant become frustrated and potentially abandon the use of the search engine. Designing effective and efficient retrieval functions is therefore of high importance to information retrieval
In the past, search engine designers have attempted to construct relevance functions that take a query and a content item as a set of inputs and return a relevance value, which indicates the relevance of the content item to the query. The relevance value may be used, for example, to order by relevance a set of content items that are responsive to a given query. For the ordering to be useful, however, the underlying relevance function should accurately and quickly determine the relevance of a given content item to a given query. Many retrieval models and methods are known to those of skill in the art, including vector space models, probabilistic models and language modeling methods, with varying degrees of success.
According to another technique, referred to as a “feature oriented” method for probabilistic indexing and retrieval, features of query-content item pairs are extracted and regression methods and decision trees are used for learning relevance functions. The sample space for feature-oriented methods is the collection of the feature vectors in the corpus of content items, which is adequate in traditional information retrieval systems where retrieval functions are usually designed to work for small or homogeneous text corpora. For more diverse corpora such as the world wide web (which introduces a substantial number of non-textual features), it is possible that similar feature vectors are labeled differently in terms of a relevance assessment, as well as very different feature vectors being labeled similarly in terms of a relevance assessment.
Thus, systems and methods are needed that incorporate query differentiation to disambiguate feature vectors within the framework of feature-oriented methods.