Web advertising is typically implemented according to two general schemes: content match and sponsored search. Content match refers to placement of advertisements (“ads”) within a webpage on the basis of the content of the web page. Sponsored search refers to placing ads on a search results page generated by a web search engine, the ads being responsive to a query that a given user submits to the web search engine. The ads placed on the search results page are selected via analysis of a query string entered into the web search engine. Those of skill in the art recognize that other factors or parameters beyond the query string may influence the selection of ads for placement on a search results page that the web search engine generates including a score that indicates the quality of the ad, a time zone of the user, user browsing history, demographic information, etc. A content match system can generate data indicating each instance that an ad is displayed on a webpage (an “impression”).
An ad network, an intermediary entity that selects the ad in the content match system, determines a most relevant ad to place on the webpage to entice a user to click on that ad. For example, on a webpage related to sports, the ad network may select ads for soft drinks, because a demographic of visitors interested in sports may be substantially similar to a demographic likely to buy soft drinks. By computing a ratio of a number of clicks on the ads to a number of impressions, the ad network can determine a click-through-rate (CTR) indicative of, inter alia, the relevancy of the ads that are selected. Thus, the CTR becomes a valuable indicator for ad networks seeking to attract business from advertisers. However, the number of clicks is typically very low compared to the number of impressions. Conventional estimation algorithms based on frequencies of event occurrences incur high statistical variance and fail to provide satisfactory predictions of the CTR because the number of clicks appears negligible in view of the large amount of impressions. Furthermore, estimating CTR from entire corpus of data might involve storing information for each impression. In a content matching system, however, this might involve crawling pages and storing the entire page content, which is expensive both in terms of storage and bandwidth requirements.
Therefore, there exists a need for a reliable sampling model for determining an occurrence of a rare event within large volumes of data.