Web advertising provides financial support for a large portion of today's Internet ecosystem, catering to a diverse set of websites, such as blogs, news, reviews, etc. Spurred by the tremendous growth in traffic in terms of volume, number of consumers, consumer engagement, content diversity, the last few years from 2008 have seen a tremendous growth in spending on web advertising.
A major part of the advertising on the web falls into the category of textual ads, which are typically short textual messages usually marked as “sponsored links” or similar. There are two main types of textual ads on the web today:                1. Sponsored search (i.e., paid search) advertising places ads on the result pages from a web search engine based on the search query. All major current web search engines support such ads and act simultaneously as a search engine and an ad agency.        2. Contextual advertising (i.e., Context Match) advertising places ads within the content of a generic, third-party web page. There usually is a commercial intermediary, called an ad-network, in charge of optimizing the ad selection with the twin goals of increasing revenue (shared between publisher and ad-network) and improving consumer experience. Here, the main players are the major search engines; however, there are also many smaller players.        
While the methods proposed in this paper could be adapted for both sponsored search sponsored search and contextual advertising, the relevant background is primarily contextual advertising.
Studies have shown that displaying ads that are closely related to the content of the page provide a better consumer experience and increase the probability of clicks. This intuition is analogous to that in conventional publishing, where there are very successful magazines (e.g., Vogue) where a majority of the content is topical advertising (e.g., fashion, in the case of Vogue). Thus, estimating the relevance of an ad to a page is critical in serving ads at run-time.
Previously, published approaches estimated the relevance based on co-occurrence of the same words or phrases within the ad and within the page. The model used in this body of work is to translate the ad search into a similarity search in a vector space. Each ad is represented as a vector of features, as for example, unigrams, phrases and classes. The page is also translated to a vector in the same space as the ads. The search for the substantially best ads is now translated into finding the ad vectors that are closest to the page vector. To make the search efficient and scalable to hundreds of millions of ads and billions of requests per day, an ad system can use an inverted index and an efficient similarity search algorithm. A drawback of this method is that it relies on a-priori information and does not use the feedback (a posteriori) information that is collected in the form of ad impressions (displays) and clicks.
Another line of work uses click data to produce a CTR (click through rate) estimate for an ad, independent of the page (or query page, in the sponsored search scenario). The CTR is estimated based on features extracted from the ads that are then used in a learning framework to build models for estimation of the CTR of unseen ads. In this approach, the assumption is that the ad system selects the ads by a deterministic method—by matching the bid phrase to a phrase from the page (or the query page in sponsored search). Accordingly, to select the most clickable ads, the ad system only needs to estimate the CTR on the ads with the matching bid phrase. This simplifying assumption of the matching process is an obvious drawback of these approaches. Another drawback is that these methods do not account for differential click probabilities on different pages: If some pages in the corpus attract an audience that clicks on ads significantly more than average, then the learning of feature weights for ads will be biased towards ads that were (only by circumstance) shown on such pages.