The present invention relates to automated classification of images.
Image classification systems take images (or video frames) as inputs, and output the labels indicating the semantic categories of the input images. It has wide applications in face recognition, object recognition, scene classification, and hand-written recognition, among others. In many state-of-the-art image classification systems, the key components include two parts, one is feature extraction, and the other is the classifier.
In practice, visual patterns have a high degree of variations in scale, translation, illumination, and deformation. Therefore an ideal feature extractor has to be invariant to these changes. To this end, feature extractions should be able to get those salient features. On the other side, the classifier should be trained on a large number of training examples and able to efficiently process each image to be categorized.
One popular model for representing an image for categorization is the bag-of-features model, which is based on collections of appearance descriptors (e.g., SIFT descriptors, Geometric-Blur, SURF, image patches, among others) extracted from local patches. The method treats an image as a collection of unordered appearance descriptors extracted from local patches, quantizes them into discrete “visual words”, and then computes a compact histogram representation for semantic image classification, e.g. object recognition or scene categorization. The key idea behind the bag-of-features model is to quantize the continuous high-dimensional feature space into discrete “visual words”, and then compute the compact histogram representation of the collection of features by assigning each local feature to its nearest “visual word”. Spatial pyramid machine (SPM) kernel represents the state-of-the-art method that extends the “bagk-of-words” approach to further consider the spatial structure of the visual words under several different scales. Under this SPM representation, classifiers using support vector machines (SVMs) using nonlinear kernel functions have achieved very good accuracy in image classification.
The BoF approach discards the spatial order of local descriptors, which severely limits the descriptive power of the image representation. By overcoming this problem, one particular extension of the BoF model, called spatial pyramid matching (SPM), has made a remarkable success on a range of image classification benchmarks, and was the major component of the state-of-the-art systems, e.g., The method partitions an image into 2l×2l segments in different scales l=0, 1, 2, computes the BoF histogram within each of the 21 segments, and finally concatenates all the histograms to form a vector representation of the image. In case where only the scale l=0 is used, SPM reduces to BoF.
People have empirically found that, in order to obtain good performances, both BoF and SPM must be applied together with a particular type of nonlinear Mercer kernels, e.g. the intersection kernel or the Chi-square kernel. Accordingly, the nonlinear SVM has to pay a computational complexity O(n3) and a memory complexity O(n2) in the training phase, where n is the training size. Furthermore, since the number of support vectors grows linearly with n, the computational complexity in testing is O(n). This scalability implies a severe limitation - - - it is nontrivial to apply them to real-world applications, whose training size is typically far beyond thousands.
The bag-of-features model and SPM approach both employ vector quantization (VQ) to extract visual words. VQ is a very coarse coding method which does not capture the salient property of images. As the consequence, the classifier has to employ additional operations to get a good performance, which makes the training procedure very expensive and the testing procedure slow. To date, this state-of-the-art approach can handle only several thousands of training examples.