Active learning is a semi-supervised method used to train models when the number of classified examples are orders of magnitude smaller than the number of examples to be classified in the future. For active learning to begin, a small set of classified samples, referred to as seed data, is needed to begin the iterative process of active learning. The size of the initial classified data set necessary is dependent on the nature of the classification model being developed and the dimensionality of the data.
For example, if a model is being developed for text classification, then the dimensionality of each data point is very high, corresponding to the number of unique words in the entire data set. For such a classifier, the seed data should include at least a few hundred classified examples. The bigger the seed data set, the better each round of active learning will be. For text classification, initial (also called seed data) data points (text samples) are classified manually to begin active learning. A rough classifier is built from the small number of examples in the seed data and this rough classifier is used to label unlabeled data. The most confident labels are manually verified and then added to the seed data. This new seed data that contains the original seed labeled data and newly added labeled data is used to train a higher quality (as more data was used to train the classifier) classifier. This process can be iterated for multiple rounds. This process is laborious, and the number of samples used for training is limited by the amount of time that can be spent manually classifying samples. Thus, the accuracy of models produced in this way can be limited.