The exemplary embodiment relates to image representation and finds particular application in connection with a system and method for representing images using weight gradients extracted from a neural network.
Image representations are widely used for image classification (also referred to as image annotation), which involves describing an image with one or multiple pre-determined labels, and similarity computation. One form of representation is the bag-of-visual-words (BOV). See, Sivic, at al., “Video Google: A text retrieval approach to object matching in videos,” ICCV, vol. 2, pp. 1470-1477, 2003; Csurka, et al., “Visual categorization with bags of keypoints,” ECCV SLCV workshop, pp. 1-22, 2004. The BOV entails extracting a set of local descriptors, encoding them using a visual vocabulary (i.e., a codebook of prototypes), and then aggregating the codes into an image-level (or region-level) descriptor. These descriptors can then be fed to classifiers, typically kernel classifiers such as SVMs. Approaches which encode higher order statistics, such as the Fisher Vector (FV) (Perronnin, et al., “Fisher kernels on visual vocabularies for image categorization,” CVPR, pp. 1-8, 2007, hereinafter, “Perronnin 2007”; and Perronnin, et al., “Improving the fisher kernel for large-scale image classification,” ECCV, pp. 143-156, 2010, hereinafter, “Perronnin 2010”), led to improved results on a number of image classification tasks. See, Sanchez, et al., “Image classification with the fisher vector: Theory and practice,” IJCV, 2013.
Convolutional Networks (ConvNets) have also been used for labeling images. See, Krizhevsky, et al., “ImageNet classification with deep convolutional neural networks,” NIPS, pp. 1106-1114, 2012, hereinafter, “Krizhevsky 2012”; Zeiler, et al., “Visualizing and understanding convolutional networks,” ECCV, pp. 818-833, 2014, hereinafter, “Zeiler 2014”; Sermanet, et al., “OverFeat: Integrated recognition, localization and detection using convolutional networks,” ICLR, 2014; Simonyan, et al., “Very Deep Convolutional Networks for Large-Scale Image Recognition,” arxiv 1409.1556, 2014, hereinafter, “Simonyan 2014.” ConvNets are trained in a supervised fashion on large amounts of labeled data. These models are feed-forward architectures involving multiple computational layers that alternate linear operations, such as convolutions or average-pooling, and non-linear operations, such as max-pooling and sigmoid activations. The end-to-end training of the large number of parameters inside ConvNets from pixel values to the specific end-task is a source of their usefulness.
ConvNets have recently been shown to have good transferability properties when used as “universal” feature extractors. Yosinski, et al., “How transferable are features in deep neural networks?” NIPS, pp. 3320-3328, 2014. If an image is fed to a ConvNet, the output of one of the intermediate layers can be used as a representation of the image. Several methods have been proposed. See, for example, Donahue, et al., “DeCAF: A deep convolutional activation feature for generic visual recognition,” ICML, 2014, hereinafter, Donahue 2014; Oquab, et al., “Learning and transferring mid-level image representations using convolutional neural networks,” CVPR, pp. 1717-1724, 2014, hereinafter, “Oquab 2014”; Zeiler 2014; Chatfield, et al., “Return of the devil in the details: delving deep into convolutional nets,” BMVC, 2014, hereinafter, “Chatfield 2014”; Razavian, et al., “CNN features off-the-shelf: An astounding baseline for recognition,” CVPR Deep Vision Workshop, pp. 512-519, 2014, hereinafter, “Razavian 2014”). To use these representations in a classification setting, a linear classifier is typically used.
Hybrid approaches have also been proposed which combine the benefits of deep learning using ConvNets with “shallow” bag-of-patches representations that are based on higher-order statistics, such as the FV. For example, it has been proposed to stack multiple FV layers, each defined as a set of five operations: i) FV encoding, ii) supervised dimensionality reduction, iii) spatial stacking, iv) l2 normalization and v) PCA dimensionality reduction. When combined with the original FV, such networks can lead to significant performance improvements in image classification. See, Simonyan, et al., “Deep Fisher Networks for Large-scale Image Classification,” NIPS, 2013). Improvements on the FV framework have been achieved by jointly learning the SVM classifier and the GMM visual vocabulary. Sydorov et al. “Deep Fisher kernels—End to end learning of the Fisher kernel GMM parameters,” CVPR, pp. 1402-1409, 2014. The gradients corresponding to the SVM layer are back-propagated to compute the gradients with respect to the GMM parameters. Good results on a number of classification tasks have been obtained by extracting mid-level ConvNet features from large patches, embedding them using VLAD (vector of locally aggregated descriptors) encoding (an extension of the Bag-of-Words representation), and aggregating them at multiple scales. See, Gong, et al., “Multi-scale orderless pooling of deep convolutional activation features,” ECCV, pp. 392-407, 2014.
The present system and method provide an efficient way to use ConvNets for generating representations that are particularly useful for computing similarity.