The present application relates to image classification.
Recent state-of-the-art image classification systems consist of two major parts: bag-of-features (BoF) and spatial pyramid matching (SPM). The BoF method represents an image as a histogram of its local features. It is especially robust against spatial translations of features, and demonstrates decent performance in whole-image categorization tasks. However, the BoF method disregards the information about the spatial layout of features, hence it is incapable of capturing shapes or locating an object. Of the many extensions of the BoF method, including the generative part models, geometric correspondence search and discriminative codebook learning, the most successful results were reported by using SPM. The SPM method partitions the image into increasingly finer spatial sub-regions and computes histograms of local features from each sub-region. Typically, subregions, are used. Other partitions such as has also been attempted to incorporate domain knowledge for images with “sky” on top and/or “ground” on bottom. The resulting “spatial pyramid” is a computationally efficient extension of the orderless BoF representation, and has shown very promising performance on many image classification tasks.
A typical flowchart of the SPM approach based on BoF is illustrated on the left of FIG. 1. First, feature points are detected or densely located on the input image, and descriptors such as “SIFT” or “color moment” are extracted from each feature point (highlighted in blue circle in FIG. 1). This obtains the “Descriptor” layer. Then, a codebook with entries is applied to quantize each descriptor and generate the “Code” layer, where each descriptor is converted into an code (highlighted in green circle). If hard vector quantization (VQ) is used, each code has only one non-zero element, while for soft-VQ, a small group of elements can be non-zero. Next in the “SPM” layer, multiple codes from inside each sub-region are pooled together by averaging and normalizing into a histogram. Finally, the histograms from all sub-regions are concatenated together to generate the final representation of the image for classification.
Although the traditional SPM approach works well for image classification, people empirically found that, to achieve good performance, traditional SPM has to use classifiers with nonlinear Mercer kernels, e.g., Chi-square kernel. Accordingly, the nonlinear classifier has to afford additional computational complexity, implying a poor scalability of the SPM approach for real applications.
To improve the scalability, researchers aim at obtaining nonlinear feature representations that work better with linear classifiers. In a method called the ScSPM method, sparse coding (SC) is used instead of VQ to obtain nonlinear codes. In ScSPM, the restrictive cardinality constraint of VQ is relaxed, and a small number of basis from the codebook can be selected to jointly reconstruct the input descriptor. The final representation achieved superior image classification performance using only linear SVM classifiers. Although the ScSPM method saves the computation of calculating Chi-square kernel in non-linear classifier, it, however, migrates the cost from classifier to feature extractor, because the SC process is very computational demanding. This is due to the fact that the objective function in SC is not differentiable at 0. Most existing SC solvers, such as Matching Pursuit (MP) or Orthogonal MP, CoordinateDescent, LARS, among others, operate iteratively.