The task of classifying/recognizing, detecting and clustering general objects in natural scenes is extremely challenging. The difficulty is due to many reasons: large intra-class variation and inter-class similarity, articulation and motion, different lighting conditions, orientations/viewing directions, and the complex configurations of different objects. FIG. 1 shows a multitude of different images. The first row 102 of FIG. 1 displays some face images. The rest of the rows 104-110 show some typical images from the Caltech 101 categories of objects. Some of the objects are highly non-rigid and some of the objects in the same category bear little similarity to each other. For the categorization task, very high level knowledge is required to put different instances of a class into the same category.
The problem of general scene understanding can be viewed in two aspects: modeling and computing. Modeling addresses the problem of how to learn/define the statistics of general patterns/objects. Computing tackles the inference problem. Let x be an image sample and its interpretation be y. Ideally, the generative models p(x|y) are obtained for a pattern to measure the statistics about any sample x. Unfortunately, not only are such generative models often out of reach, but they also create large computational burdens in the computing stage. For example, faces are considered a relatively easy class to study. Yet, there is no existing generative model which captures all the variations for a face such as multi-view, shadow, expression, occlusion, and hair style. Some sample faces can be seen in the first row 102 of FIG. 1. Alternatively, a discriminative model p(y|x) is learned directly, in which y is just a simple variable to say, “yes” or “no”, or a class label.
A known technique referred to as AdaBoost and its variants have been successfully applied in many problems in vision and machine learning. AdaBoost approaches the posterior p(y|x) by selecting and combining a set of weak classifiers into a strong classifier. However, there are several problems with the current AdaBoost method. First, though it asymptotically converges to the target distribution, it may need to pick hundreds of weak classifiers. This poses a huge computational burden. Second, the order in which features are picked in the training stage is not preserved. The order of a set of features may correspond to high-level semantics and, thus, it is very important for the understanding of objects/patters. Third, the re-weighing scheme of AdaBoost may cause samples previously correctly classified to be misclassified again. Fourth, though extensions from two-class to multi-class classification have been proposed, learning weak classifiers in the multi-class case using output coding is more difficult and computationally expensive.
Another known method typically referred to as AdaTree combines AdaBoost with a decision tree. The main goal of the AdaTree method is to speed up the AdaBoost method by pruning. The AdaTree method learns a strong classifier by combining a set of weak classifiers into a tree structure, but it does not address multi-class classification.
A number of approaches exist to handle object classification and detection. Cascade approaches that are used together with AdaBoost have shown to be effective in rare event detection. The cascade method can be viewed as a special case of the method of the present invention. In the cascade, a threshold is picked such that all the positive samples are pushed to the right side of the tree. However, pushing positives to the right side may cause a big false positive rate, especially when the positives and negatives are hard to separate. The method of the present invention naturally divides the training sets into two parts. In the case where there are many more negative samples than positive samples, most of the negatives are passed into leaf nodes close to the top. Deep tree leaves focus on classifying the positives and negatives which are hard to separate.
Decision trees have been widely used in vision and artificial intelligence. In a traditional decision tree, each node is a weak decision maker and thus the result at each node is more random. In contrast, in the present invention, each tree node is a strong decision maker and it learns a distribution q(y|x). Other approaches include A*, generative models, EM and grammar and semantics. There is a need for a framework that is capable of learning discriminative models for use in multi-class classification that is not computationally burdensome.