The performance of traditional classification methods is prone to deterioration when presented with significant class imbalance. Class imbalance occurs when the instances of one class are fewer in number than the instances of another class. More specifically, the term “class imbalance” refers to a relative imbalance between two classes, i.e., a minority class and a majority class, with class instance ratios on the order of 100 to 1, 1000 to 1, or higher.
The class imbalance issue has attracted considerable attention in recent years due to the inherent presence of class imbalances in many applications, including, for example, fraud detection applications, anomaly detection applications, and medical diagnosis applications. In addition, class imbalances may arise in any application in which the class distribution is not explicitly controlled during data collection. In many cases, class imbalances result in difficulty in detecting the minority class. For instance, in a medical test, there are typically significantly more negative instances than positive instances. Due to the scarcity of the positive class, a classifier that favors the negative class will produce an overall low error rate. However, false negatives are potentially catastrophic, while false positives simply warrant more testing. Thus, it is clear that providing fair classification with respect to minority classes is important.
Many methods designed to handle class imbalance fall into one of two categories, sampling methods and cost-sensitive methods. Sampling methods operate on the data itself, attempting to reduce the imbalance between classes by oversampling the minority class and/or under-sampling the majority class. Cost-sensitive methods apply more weight to errors made on the minority class, and may be applied to the data or incorporated into the classification algorithms themselves. Both sampling and cost-sensitive methods are tuned, either through the amount of sampling or through the relative costs assigned to each class, to provide the desired balance between classes. However, the user must provide this tuning. In many applications, the degree of imbalance will change, particularly when classifying online streaming data. However, since the algorithm is tuned to the degree of imbalance present in the training data set, such changes in the degree of imbalance must be accounted for via user intervention, which may become costly and time-consuming for the user.