1. Technical Field
A “General Click Model” (GCM) provides various techniques for modeling or predicting user click behaviors towards URLs displayed on a search engine results page, and in particular, various techniques for using a nested Bayesian network that is inherently capable of modeling “tail queries” by building a click model on multiple attribute values that are shared across queries in combination with query-specific attributes for individual queries.
2. Related Art
Recent advances in “click modeling” have positioned it as an attractive method for representing user preferences in web search and online advertising. However, most conventional click modeling techniques focus on training the click model for individual queries, and, due to a lack of sufficient training data, cannot accurately model the tail queries (i.e., those queries after a set of the top n queries, such that there are relatively few instances of those queries). In addition, most conventional techniques consider the query, URL, and position, neglecting other useful attributes in click log data, such as the local time, geographic region, demographic data, etc.
Utilizing implicit feedback allows a search engine to better respond to its millions or billions of users. Given a query, whether the user clicks a URL is strongly correlated with the user's opinions on the URL. Besides, implicit feedback is readily available. In fact, terabytes of such data is produced every day, with which a search engine can automatically adapt to the needs of users by putting the most relevant search results and advertisements in the most conspicuous places.
Various conventional techniques use implicit feedback such as click data in various ways, including the optimization of search engine ranking functions, the evaluation of different ranking functions, and even towards the display of advertisements or news. Most such techniques rely on a core method that involves learning a click model. In general, conventional search engines log a large number of real-time query sessions, along with the user's click-or-not flags (i.e., whether or not the user clicked on a particular URL or not). This real-world data is then used as the training data for learning the click model, which is then used for predicting click-through-rates (CTR) of future query sessions. The CTR can help improve the normalized discounted cumulative gain (NDCG) (or other statistical measures) of the search results, and plays an essential role in search auctions for ad placements.
However, as is well known to those skilled in the art, clicks are biased with respect to URL presentation order (e.g., URLs higher on the page tend to be clicked more often), user-side configuration differences (e.g., display resolution, web browser being used, etc.), reputation of sites, etc. In fact, one eye-tracking experiment observed that users tend to click web documents at or near the top of a page even if the search results are shown in reverse order.
Recently, a number of studies have tried to explain position-biased click data. For example, one study suggested that the relevance of a document at position i should be further multiplied by a term xi, and this idea was later formalized as the conventional “examination hypothesis” or the “position model”. Another later study compared the examination hypothesis to the conventional “cascade model”, which describes a user's behavior by assuming she scans from top to bottom. Because the cascade model takes into account the relevance of URLs above the clicked URL, it has been observed to outperform the examination hypothesis.
Yet another conventional technique was used to extend the examination hypothesis by considering the dependency on the positional distance to the previous click in the same query session. Related techniques have used Bayesian network click models that generalized the cascade model by analyzing user behavior in a chain-style network, within which the probability of examining the next result depends on the position and the identity of the current document.
Nevertheless, despite their successes, the conventional techniques mentioned above suffer from several limitations. First, they focus on training the click model for each individual query, and cannot accurately predict tail queries (i.e., low frequency queries beyond some number n of the top queries—generally referred to as “head queries”) due to the inadequate training data. Second, the aforementioned models only consider the position-bias, neglecting other session-specific factors or biases that could potentially bias user clicks.