Machine learning evolved from the study of pattern recognition and computational learning theory in artificial intelligence. Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a machine-implemented model from example inputs in order to make data-driven predictions or decisions rather than following strictly static program instructions.
One type of machine learning involves supervised learning based on a training set as part of a classification process. Examples of machine learning algorithms used for classification include the well-known Naïve Bayes and C4.5 algorithms, or a so-called “stacked” combination of two or more such algorithms. The machine learning algorithm examines the input training set, and the computer ‘learns’ or generates a classifier, which is able to classify a new document or another data object under one or more categories. In other words, the machine learns to predict whether a document or another type of data object, usually provided in the form of a vector of predetermined attributes describing the document or data object, belongs to a category. When a classifier is being trained, classifier parameters for classifying objects are determined by examining data objects in the training set that have been assigned labels indicating to which category each object in the training set belongs. After the classifier is trained, the classifier's goal is to predict to which category an object provided to the classifier for classification belongs.
A technical problem associated with classifiers is that, in practice, the classifiers that assign objects to categories make mistakes. For example, classifiers may generate false positives, i.e., instances of mistakenly assigning an object to a category, and false negatives, i.e., instances of mistakenly failing to assign an object to a category when the object belongs in the category. These mistakes are often caused by deficiencies of the training set. For example, typically, the larger the training set, the better the classification accuracy. In many instances, large training sets may be unavailable. Also, the accuracy of the labeled data in the training set impacts the classification accuracy. In some cases, the data in the training set may not be correctly labeled, causing the classification accuracy to be compromised.