The present invention relates to online learning of a classifier in a machine learning framework that includes supervised and semi-supervised online learning algorithms in a machine learning framework.
Such classifiers may be trained using training samples, and then used in a so-called testing or prediction stage, to classify test samples. For example, such classifiers may be used in an automated factory inspection application to detect defects in an image of a product. In this case, a “sample” may consist of a set of data derived from an image or pixels in a region of the image, and the task of the classifier is to classify “sample” as “defect” (positive class) or “non-defect” (negative class). As another example, such classifiers may be used to classify defects into different categories.
Machine learning applications usually include two stages: training stage and prediction stage. In the training stage, traditionally all training samples are available as a batch at the beginning; a statistical model can then be trained based on the training samples. In the prediction stage, the statistical model obtained from the training stage is then used to classify new samples into different categories. However, in some machine learning tasks, not all the training samples are available at the initial training stage. More samples will be acquired and may be labeled as time goes on. It is desirable to use this new information to refine and improve the classifier for future usage. In some other applications, the data properties might be changing over time, or even not generated from any distribution. The model trained with the initial samples can only accommodate the initial properties, so it might become useless as new samples arrive over time.
One way of solving this problem would be re-training the model with all samples including the initial samples and the newly obtained samples. However, re-training from scratch will usually be time-consuming, and it's not efficient to perform the re-training frequently in an online application. Therefore, a mechanism is desirable such that the model can be updated by the newly obtained samples in an online fashion during the prediction stage without complete re-training.
Lots of statistical models can be used as the classifier including Normal Support Vector Machines, Decision Trees, Boosted Decision Trees and Neural Networks. The Boosted Decision Tree (or Boosting Tree) may be the statistical model of the classifier. Thus each of the types of statistical models may be generally referred to as a respective class.
Boosting is based on the use of an ensemble of weak classifiers that is not constrained to specific classifiers. In a boosting algorithm, a weak classifier is trained with respect to the samples and the associated weights. At each iteration, a weak classifier is added to form a final strong classifier. The weak classifiers are typically weighted by their accuracy. After a weak classifier is added, the samples are reweighted: the weights of the misclassified samples will be increased, and the samples that are classified correctly will have decreased weights. Weak classifiers that are subsequently added will be trained based on the re-weighted samples, focusing more on the misclassified samples.
The weak classifier may be in the form of a Decision Tree (DT). Decision Tree (DT) is a binary tree (i.e. tree where each non-leaf node has exactly 2 child nodes). The training and prediction of the Decision Tree (DT) is described as follows.
Training Decision Trees. The tree is built recursively, starting from the root node. The whole training data (feature vectors and the responses) are used to split the root node. In each node the optimum decision rule (i.e. the best “primary” split) is found based on some criteria (gini “purity” criteria is used for classification). Then, if necessary, the surrogate splits are found that resemble at the most the results of the primary split on the training data; all data are divided using the primary and the surrogate splits (just like it is done in the prediction procedure) between the left and the right child node. Then the procedure recursively splits both left and right nodes. At each node the recursive procedure may stop (i.e. stop splitting the node further). When the tree is built, it may be pruned using cross-validation procedure, if need. That is, some branches of the tree that may lead to the model overfitting are cut off. Normally, this procedure is only applied to standalone decision trees, while tree ensembles usually build small enough trees and use their own protection schemes against overfitting.
Predicting with DT: to reach a leaf node, and thus to obtain a response for the input feature vector, the prediction procedure starts with the root node. From each non-leaf node the procedure goes to the left, or to the right based on the value of a certain variable. If the value of the variable is less the threshold This pair is called split. Once a leaf node is reached, the value assigned to this node is used as the output of prediction procedure.
The foregoing and other objectives, features, and advantages of the invention may be more readily understood upon consideration of the following detailed description of the invention, taken in conjunction with the accompanying drawings.