A conventional machine learning algorithm undergoes a training process by inputting known data and comparing actual output to expected output. As this process is iteratively performed, the machine learning algorithm is updated in an attempt to have the actual output match (or be within a predefined error bound of) the expected output. After the actual output matches the expected output, the machine learning algorithm may operate on unknown input data and an operator can be confident that the output generated is correct.
When using the machine learning algorithm on very large data set, the training process can be onerous. That is, an operator typically selects a training set comprising a number of samples from the data set. However, given the volume of the data set, it is entirely unrealistic to assume that every sample in the training set can be manually labeled such that when they are passed to the machine learning algorithm, correct labels are output. Additionally, arbitrary selection of the samples in the training set does not ensure that those samples are the best to train the machine learning algorithm, e.g., there is no indication that the selected samples have an objectively greater impact on the efficiency of training the machine learning algorithm.
Therefore, there exists a need for identifying a training set comprising samples of a data set that may most effectively train a machine learning algorithm.