In a wide variety of applications ranging from visual target tracking to speech recognition, binary or multi-class classification problems are frequently encountered and used as one of the building blocks. In most of these problems, a need for processing large amount of data that are continuously streamed naturally arises. To this end, the incremental learning (or update) of a classification model in those problems is almost unavoidable. The incremental learners are naturally to be, computationally simple, capable of handling the complex separations in the data and quickly adaptive to the possible variations in the source statistics such as the appearance changes of a target being tracked. The method in this invention achieves these goals.
It should be first emphasized that the incremental learner of the method in this invention is based on a set (in the course of the input data stream) of varying partitionings of the observation space, i.e., the partitionings are tuned to the observed data in a way to minimize the classification or regression error, and locating a local classifier on each distinct region of the partitionings. This set of varying partitionings and the corresponding local classifiers are organized via a binary tree. Our method accepts a data stream to process via pushing one more data point to the tree at every time step. To test a data point, considering all the regions defined by the partitionings at the corresponding time step, the classification results of the local classifiers in the regions that this data point falls in are specially combined, i.e., weighted and summed, to obtain the final classification. Here, the local classifiers, the weights of the combinations and the generation of the aforementioned set of partitioning are incrementally updated to lower the empirical classification error at each step.
The most relevant designs (Ozkan et al.), (Wang et al.) to this method in the corresponding literature exploit the idea of using localized classifiers to a limited degree. In a recent conference proceeding (Wang et al.), a rejection cascade of classifiers is designed for classification problems in the batch processing setting, where the nodes of that cascade design correspond to a region in the space of observations. These regions and the partitioning structure that is resulted are adjusted given a batch of data iteratively. No results or extension for incremental processing are provided. Furthermore, the cascade structure in that method is a very simple one and a natural extension via balanced trees is not mentioned. Also, their local classifiers are located at the deep leaf nodes, which create an immediate weakness when there is not sufficiently available data since the amount of data to populate the regions in the leaf nodes increases exponentially with the data dimensionality. This is the very case at the beginning of a data stream in high dimension in the sequential setting, which is handled by the adaptive mixtures in this method.
In another design (Ozkan et al.), (1) the authors also exploit the idea of space partitionings. However, the set of partitioning in their method is basically based on listing all possible partitionings (up to certain region granularity) and it is fixed in the course of the data stream. This adversarially affects the adaptiveness of the corresponding incremental learner since in high dimension listing the all possible partitioning up to an arbitrary granularity is not feasible. (2) The authors (Ozkan et al.) also make use of the combination of some local classifiers, which are a simplified version of LDA classifiers in that case. However, the combination they use does not directly target at minimizing the classification accuracy, which clearly sacrifices from the classification accuracy. Moreover, LDA's are not originally designed for incremental learning. Although there exist several incremental versions of LDAs, the updates usually are more complicated compared to a perhaps more natural choice of perceptrons.
The United States patent publication numbered U.S. Pat. No. 5,799,311 A discloses a method for generating a decision-tree classifier from a training set of records, independent of the system memory size. In this invention, each node split is determined based on only one attribute using an appropriate criterion such as the Gini index. Also each split is determined once only and then it is fixed. Clearly, the method disclosed in U.S. Pat. No. 5,799,311 A is not appropriate for sequential processing. Moreover, the corresponding partitioning is fixed and no mixture of experts type of an approach is used to handle the case of insufficient number of available data points.
The United States patent publication numbered US20050246307 A1 discloses a method for “employing a hybrid Bayesian decision tree for classification”, which has the capability of incrementally updating classification trees. The decision trees of this patent have one attribute evaluation only at each node. Moreover, the incremental learning in that context only means to create a new tree for some synthetic data points for which no tree model is existent as opposed to incrementally adjusting one existing tree structure.
The United States patent publication numbered U.S. Pat. No. 6,269,353 B1 discloses a method for constructing decision trees, which learns multi-feature splits at the nodes of a decision tree using a neural network. This method does not target the case of sequential processing since it assumes the availability of a training set in advance. Hence, it does not have the capability of updating the structure of the decision tree on demand incrementally. No comment is made regarding the computational complexity, however our method processes every data point once only and does not store. Our method also uses a mixture of experts idea to handle the sparse data, which is an important cold start problem at the beginning of a data stream in the sequential setting. Our mixture of experts implementation favors simpler model trained with all available data points at the beginning and starts favoring more complex models as more data are streamed.
In general, the existing work and inventions regarding the incremental updates of decision trees have common drawbacks: (1) the split at each node of the tree is usually single attribute based, (2) the split criterion is usually based on class label purities such as the Gini index. In our invention, we take a rather radical approach, which considers the split as a separate binary classification model and operates sequentially, i.e., the classification model for each split is learned incrementally.