The present invention relates to systems and methods for determining image representations at a pixel level.
Sparse coding refers to a general class of techniques that automatically select a sparse set of vectors from a large pool of possible bases to encode an input signal. While originally proposed as a possible computational model for the efficient coding of natural images in the visual cortex of mammals, sparse coding has been successfully applied to many machine learning and computer vision problems, including image super-resolution and image restoration. More recently, it has gained popularity among researchers working on image classification, due to its state-of-the-art performance on several image classification problems.
Many image classification methods apply classifiers based on a Bag-of-Words (BoW) image representation, where vector-quantization (VQ) is applied to encode the pixels or descriptors of local image patches, after which the codes are linearly pooled within local regions. In this approach, prior to encoding, a codebook is learned with an unsupervised learning method, which summarizes the distribution of signals by a set of “visual words.” The method is very intuitive because the pooled VQ codes represent the image through the frequencies of these visual words.
Sparse coding can easily be plugged into the BoW framework as a replacement for vector quantization. One approach uses sparse coding to construct high-level features, showing that the resulting sparse representations perform much better than conventional representations, e.g., raw image patches. A two stage approach has been used where sparse coding model is applied over hand-crafted SIFT features, followed by a spatial pyramid max pooling. When applied to general image classification tasks, this approach has achieved state-of-the-art performance on several benchmarks when used with a simple linear classifier. However, this is achieved using sparse coding on top of hand-designed SIFT features.
A limitation of the above approaches is that they encode local patches independently, ignoring the spatial neighborhood structure of the image.