Image category classification or recognition is enabling many emerging multimedia applications, e.g., photo album management, mobile visual search, and image tagging as a service. The most popular paradigm of image classification is based on bag-of-words (BoW), which generally involves four steps: local descriptor extraction, descriptor coding, pooling, and classification.
In conventional BoW process, sparse or dense local invariant descriptors are detected and coded by a coding process such as sparse coding, local coordinate coding (LCC), or super-vector coding. Then, coding vectors are pooled to construct image-level representations, which are fed to classifiers such as linear SVM classifiers to output category predictions. The coding methods code BoW histograms into high dimensional feature spaces, e.g., using codebooks with 8K to 16K visual words, to enhance the image feature's discriminative power. However, these approaches generally induce intensive computations that are costly when processing a large number of images. For example, some existing approaches employ advanced coding algorithms to generate image-level representation from local invariant features and apply certain classifiers, which are generally computationally intensive. Other state-of-the-art image classification methods often involve heavy computation on both feature extraction and classifier training.