The present invention relates to image processing techniques.
State-of-the-art category-level image classification algorithms usually adopt a local patch based, multiple-layer pipeline to find good image features. Many methods start from local image patches using either normalized raw pixel values or hand-crafted descriptors such as SIFT or HOG, and encode them into an overcomplete representation using various algorithms such as K-means or sparse coding. After coding, global image representations are formed by spatially pooling the coded local descriptors. Such global representations are then fed into non-linear classifiers or linear classifiers, with the latter being more popular recently due to their computation efficiency. Methods following such a pipeline have achieved competitive performance on several challenging classification tasks, such as Caltech-101 and Pascal VOC.
During the last decade, much emphasis has been directed at the coding step. Dictionary learning algorithms have been discussed to find a set of basis that reconstructs local image patches or descriptors well, and several encoding methods have been proposed to map the original data to a high-dimensional space that emphasizes certain properties, such as sparsity or locality. The relationship between dictionary learning and encoding have been used to provide simple yet effective approaches that achieve competitive results. The neuroscience justification of coding comes from simple neurons in the human visual cortex, which have been believed to produce sparse and overcomplete activations.
Similarly, the idea of spatial pooling dates back to Hubel's seminal paper about complex cells in the mammalian visual cortex, which identifies mid-level image features that are invariant to small spatial shifting. The spatial invariance property also reflects the concept of locally orderless images, which suggests that low-level features are grouped spatially to provide information about the overall semantics. Most recent research on spatial pooling aims to find a good pooling operator, which could be seen as a function that produces informative statistics based on local features in a specific spatial area. For example, average and max pooling strategies have been found in various algorithms respectively, and systematic comparisons between such pooling strategies have been used to pool over multiple features in the context of deep learning.
Relatively little effort has been put into better designs or learning of better spatial regions for pooling, although it has been discussed in the context of learning local descriptors. A predominant approach to define the spatial regions for pooling, which we will also call the receptive fields (borrowing the terminology from neuroscience) for the pooled features, comes from the idea of spatial pyramids, where regular grids of increasing granularity are used to pool local features. The spatial pyramids provide a reasonable cover over the image space with scale information, and most existing classification methods either use them directly, or use slightly modified/simplified versions.