This specification relates to data processing such as data mining and outcome estimation.
The performance of content items are often tracked by a system that manages the content items. For example, an advertisement server may track the performance of advertisements it serves by recording the number of impressions the advertisement receives, and the number of clicks on the advertisements. Such performance data can be processed to generate predictive models that can predict the performance of the same or similar content items in future situations.
Data mining is one such example process. Data mining is used, for example, to identify feature values that are associated with a data set of content items and that are indicative of a particular result. A feature value is a value that represents a state or measurement of a feature. Feature values are often used to represent characteristics of content items (e.g., advertisements, audio, video, or text). For example, feature values can be values that represent specific colors, animation characteristics, size characteristics, similarity measures, and other features of content items. Feature values can be selected from a specified set of discrete values (e.g., 0 or 1) or feature values can be selected from a continuous range of values (e.g., 0-10). For example, a feature value of 0 (representing “no”) or 1 (representing “yes”) can be used to specify whether an advertisement is a static advertisement (i.e., is not animated). Similarly, a set of feature values can be used to specify one or more colors (e.g., 00 representing black and 01 representing yellow) that are included in an advertisement.
The identified feature values for the data set and results (e.g., performance data) associated with the data set can be used to create and train a model that predicts future outcomes or results for a content item represented by a data record storing feature values that describe the content item. For example, curve fitting techniques (e.g., regression analysis, logistic regression, etc.) can be used to generate a model that specifies relationships between feature values and outcomes. In turn, the model can be applied to feature values of a data record to obtain an outcome or result based on the feature values of the data record. Data classifiers (e.g., support vector machines) can also be used to classify data into one or more specified data classifications.
The quality of models generated using different modeling techniques is generally judged using different measures of prediction quality (e.g., accuracy measures and/or error measures). For example, regression techniques may use a measure such as Mean Square Error to measure how accurately a regression model is estimating outcome values, while a ranking model (e.g., a support vector machine) that is generated to estimate relative rankings of data records may use a measure such as the area under a receiver operating characteristic (ROC) curve to determine how well the ranking model is estimating relative ranks for data records. These prediction quality measures, however, do not necessarily correlate positively. Thus, a regression model may be judged as having very good prediction quality using the Mean Square Error measure, but may be judged as having less prediction quality using the area under the ROC curve measure.