A classifier is an example of an algorithm that may be derived using supervised machine learning. In order to make accurate predictions, supervised machine learning classifiers are derived through training using a set of labeled data examples. In modeling a classification problem in which the classifier must make a categorical prediction, the training data set should contain many labeled examples of each possible category to ensure that the classifier will make accurate predictions for new input examples that might fall into one of the categories.
A common way to improve a classifier's prediction performance is to sample a labeling data set from the general population, obtain true labels for the labeling set, and add these labels to the training set used to derive the classifier. For some classification problems in which instances of one or more of the classification categories are relatively rare (e.g., predicting gene mutations, earthquakes, or ad click throughs), the distribution in the general population of the true labels is skewed in favor of the most commonly occurring instances. If naïve (random) sampling from a general population were used to generate a labeling set, it is likely that the labeling set, after being labeled, also will have a skewed distribution of labels. A skewed distribution of labels may not add much support to the model to make predictions for rare events.
Current methods for dynamic optimization of a data set distribution exhibit a plurality of problems that make current systems insufficient, ineffective and/or the like. Through applied effort, ingenuity, and innovation, solutions to improve such methods have been realized and are described in connection with embodiments of the present invention.