Website traffic data often comprises many millions of visitors, while only a small portion of these visitors actually generate revenue. Imbalanced data refers to data that exhibits between-class imbalance, for example, a few objects/events of interest (referred to herein as positives) vs. a large number of irrelevant cases (referred to herein as negatives). Data imbalance may be categorized as two types: intrinsic and extrinsic. Intrinsic refers to the imbalance resulting from the nature of the dataspace. Extrinsic refers to other cases, for instance, a stream of data that is balanced overall but not uniformly distributed (thus a data sample might be imbalanced for some interval).
Because businesses deal with very large amounts (hundreds of millions) of data, it is impractical to feed all the data points to a prediction module that may be utilized to identify potentially valuable customers. Moreover, to satisfy customers and avoid unnecessary website delays, prediction models need to work in near real time. Further, because of memory and efficiency constraints, the prediction models need to work with a sample of data, rather than the entire data set. However, taking a sample of data can make the imbalance problem even worse in several ways: 1) the absolute number of positive cases will be significantly reduced; and 2) if the data is not uniformly distributed, the percentage of positive cases in a random sample could be even less than the original data (the extrinsic imbalance).
In general, standard algorithms for learning and predicting expect balanced class distributions. When dealing with imbalanced data containing only a small number of positives, they tend to overfit training data and perform unfavorably on unseen testing data. In addition, the resulting model is unstable and hardly repeatable. Thus the estimated models are noisy and unlikely to produce reliable predictions.