1. Field
The disclosure relates generally to supervised machine learning and more specifically to generating artificial data samples for a minority data class from an imbalanced training data set to train a multi-class classifier model of a supervised machine learning program.
2. Description of the Related Art
Supervised machine learning programs require training data that includes different classes of data to train multi-class classifier models. Supervised learning is the machine learning task of inferring a function from labeled training data. A supervised machine learning program analyzes the training data and generates an inferred function, which is used for mapping new examples. In supervised machine learning, multi-class classification is the problem of classifying data into two or more classes. Unfortunately, in many real world applications, the available training data set is highly imbalanced, that is, one class of data in the available training data set is very sparse or non-existent. In other words, a training data set is imbalanced if the data classes are not equally represented (i.e., one data class (a minority class) includes a smaller number of examples than other data classes in the training data set. For example, in anomaly detection systems or diagnosis systems, anomalous data may be extremely difficult to collect, mainly due to the rare occurrence of such abnormal events. Data class imbalance creates difficulties for supervised machine learning programs and decreases classifier performance. Consequently, training a multi-class classifier model using a highly imbalanced training data set results in ignorance of a minority data class.