1. Technical Field
A “Real-Time-Ready Analyzer,” as described herein, combines a data stream management system (DSMS) with a map-reduce (M-R) framework to construct a streaming map-reduce framework that is suitable for use in performing temporal queries, such as real-time Behavioral Targeting (BT), on very large data sets.
2. Background
As the Web becomes increasingly ubiquitous, online advertisement delivery platforms are witnessing an increasing volume of users performing activities such as searches and webpage visits. For example, consider the problem of display advertising, where ads need to be selected and shown to users as they browse the Web. Behavioral Targeting (BT) is a relatively new technology, where the system selects the most relevant ads to display to users based on their observed prior behavior such as searches, webpages visited, etc. Briefly, BT builds a behavior profile for each user (also referred to as a UBP or “user behavior profile”), and utilizes these profiles and ad click behavior of previous users to predict the relevance of each ad for a current user who needs to be delivered an ad. A common measure of relevance for BT is click-through-rate (CTR), which represents the fraction of ad impressions that result in a click. Note that BT is different from both content matching, where ads are chosen based on the webpage content, and sponsored search that relies only on the session information (search) to choose ads on search result pages. Many well-known advertising companies use BT as a part of their advertising platform.
In general, conventional advertisement systems collect and store data related to billions of users and hundreds of thousands of ads. For effective BT, multiple steps need to be performed on the data in a scalable manner. These steps include:                Bot Elimination: Advertisement systems generally detect and eliminate bots, which are automated surfers and ad clickers, to eliminate spurious data before further analysis, for more accurate BT.        Data Reduction: The UBPs are sparse and of extremely high dimensionality, with millions of possible keywords and URLs. Thus, it is beneficial to eliminate useless information in a manner that retains and amplifies the most important signals for subsequent operations. Some common data reduction schemes used for BT include: (1) mapping keywords to a smaller set of concepts by feature extraction; and (2) retaining only the most popular attributes by feature selection.        Model Building and Scoring: Advertisement systems work best with accurate models built from the behavior profiles, based on historical information about ad effectiveness. For example, one conventional technique groups similar users using clustering algorithms, while another conventional technique fits a Poisson distribution as a model for the number of clicks and impressions. The models are then used to score active users in real-time, i.e., predict ad relevance in order to choose suitable ads for delivery.        
Prior work on BT has focused on algorithms and techniques that scale well for large-scale historical offline data using the well-known “map-reduce” (M-R) framework. However, many BT queries are fundamentally temporal and not easily expressible in the M-R framework. Consequently, the generally high turnaround time for BT can result in missed ad presentation opportunities since such systems are not typically capable of operating and analyzing real-time data feeds directly.
More specifically, existing BT techniques are geared towards offline processing over a map-reduce cluster. For example, current data reduction proposals for UBPs (i.e., user behavior profiles) include: (a) reducing data using popularity-based feature selection, i.e., retaining the most popular keywords; and (b) mapping keywords to a smaller set of categories. Unfortunately, neither of these techniques performs well for detecting important signals in the massive volume of data, or for responding quickly to rapidly changing trends and interests.
Current temporal analysis methodologies for BT and other applications work only on offline data, by writing custom SCOPE/map-reduce scripts that process offline data in a scalable manner on a cluster. These solutions are generally difficult to specify, implement, test, debug, maintain, etc., due to the fundamental temporal nature of the data. Further, these solutions do not directly work on real-time data streams.