The exemplary embodiment relates to image representation, for tasks such as classification and retrieval, and finds particular application in a system and method for aggregating encoded local descriptors using a pooling function which allows more weight to be placed on local descriptors that are less frequently occurring in the pool of descriptors.
Conventional image classification methods include extracting patches from the image and generating a representation of each patch, called a local descriptor or patch descriptor. The patch descriptors (such as SIFT or color descriptors) are then encoded using an embedding function φ that maps the descriptors in a non-linear fashion into a higher-dimensional space to form embedded patch descriptors. The embedded descriptors are then aggregated into a fixed-length vector or image representation using a pooling function. Representations of this type include the Bag-Of-Visual-words (BOV) (see, G. Csurka, et al., “Visual categorization with bags of keypoints,” ECCV SLCV workshop 2004, hereinafter, Csurka 2004; J. Sivic, et al., “Video Google: A text retrieval approach to object matching in videos,” ICCV 2003, and U.S. Pub. No. 20080069456), the Fisher Vector (FV) (see, F. Perronnin, et al., “Fisher kernels on visual vocabularies for image categorization,” CVPR 2007, hereinafter, Perronnin 2007, and U.S. Pub. Nos. 20070005356 and 20120076401), the Vector of Locally Aggregated Descriptors (VLAD) (see, H. Jégou, et al., “Aggregating local image descriptors into compact codes,” TPAMI 2012, hereinafter, Jégou 2012), the Super Vector (SV) (see, Z. Zhou, et al., “Image classification using super-vector coding of local image descriptors,” ECCV 2010 hereinafter, Zhou 2010) and the Efficient Match Kernel (EMK) (see, L. Bo, et al., “Efficient match kernel between sets of features for visual recognition,” NIPS 2009, hereinafter, Bo 2009).
Pooling is the operation which involves aggregating several patch embeddings into a single representation. While pooling achieves some invariance to perturbations of the descriptors, it may lead to a loss of information. To reduce this loss as much as possible, only close descriptors should be pooled together. To enforce the pooling of close descriptors in the geometric space, it is possible to use spatial pyramids (see, S. Lazebnik, et al., “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” CVPR, 2006). In the descriptor space, the closeness constraint is achieved through the choice of an appropriate embedding φ.
Pooling is typically achieved by either averaging/summing or by taking the maximum response. A common pooling mechanism involves averaging the descriptor embeddings (see, Csurka 2007, Perronnin 2007, Jégou 2012, Zhou 2010, and Bo 2009). Given a set of patch descriptors {x1, . . . , xM}, the average-pooled representation is simply
      1    M    ⁢            ∑              i        =        1            M        ⁢                  ⁢                  φ        ⁡                  (                      x            i                    )                    .      An advantage of average pooling is its generality, since it can be applied to any embedding. A disadvantage of this method, however, is that frequent descriptors will be more influential in the final representation than rarely-occurring ones. By “frequent descriptors” it is meant descriptors which, although not necessarily identical, together form a mode in descriptor space. However, such frequently-occurring descriptors are not necessarily the most informative ones.
As an example, consider a fine-grained classification task where the goal is to distinguish bird species. In a typical bird image, most patches might correspond to background foliage or sky and therefore carry little information about the bird class. On the other hand, the most discriminative information might be highly localized and therefore correspond to only a handful of patches. Hence, it is desirable to ensure that even those rare patches contribute significantly to the final representation.
The problem of reducing the influence of frequent descriptors has received a great deal of attention in computer vision. This issue can be addressed at the pooling stage or a posteriori by performing some normalization on the image-level pooled descriptor. Several approaches have been proposed to address the problem of frequent descriptors at the pooling stage. However, all of these solutions are heuristic in nature and/or limited to certain types of embeddings. For example, one approach, referred to as max pooling (see, Y.-L. Boureau, et al., “A theoretical analysis of feature pooling in visual recognition,” ICML 2010) is only applicable when applied to descriptor embeddings which can be interpreted as counts, as is the case of the BOV. It is not directly applicable to those representations which compute higher-order statistics, such as the FV, the VLAD, the SV or the EMK.
Several extensions to the standard average and max pooling frameworks have been proposed. For example, a smooth transition from average to max pooling can be considered. It is also possible to add weights to obtain a weighted pooling (see, T. de Campos, et al., “Images as sets of locally weighted features,” CVIU, 116 (1), pp. 68-85 (2012) (de Campos 2012)). The weights in de Campos 2012 are computed from a separate saliency model to attempt to cancel-out the influence of irrelevant descriptors, but such a model may not necessarily equalize the influence of frequent and rare descriptors.
There remains a need for a pooling method which is generic and applicable to all aggregation-based representations.