The amount of digital image data grows exponentially in time. The available amount of digital image data, available on the internet, for instance, is huge. Various methods are proposed for searching in this digital image data.
Currently, a computational approach that is used to category recognition applies convolutional neural networks.
In a learning phase, these networks have large numbers of parameters to learn. This is their strength, as they can solve extremely complicated problems. At the same time, the large number of parameters is a limiting factor in terms of the time needed and of the amount of data needed to train them (A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. NIPS, 2012; A. Coates and A. Y. Ng. Selecting receptive fields in deep networks. NIPS, 2011). For the computation time, the GoogLenet architecture (C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. arXiv:1409.4842, 2014) trains up to 21 days on a million images in a thousand classes on top notch GPU's to achieve a 4% top-5-error.
For many practical small data problems, pre-training on a large general dataset is an alternative, or otherwise unsupervised pre-training on subsets of the data.
In the literature, an elegant approach to reduce model complexity has been proposed by J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE T-PAMI, 35(8):1872-1885, 2013. By the convolutional scattering network cascading Wavelet transform convolutions with nonlinearity and pooling operators. On various subsets of the MNIST benchmark, they show that this approach results in an effective tool for small dataset classification. The approach computes a translation-invariant image representation, stable to deformations, while avoiding information loss by recovering wavelet coefficients in successive layers yielding state-of-the-art results on handwritten digit and texture classification, as these datasets exhibit the described invariants. However, the approach is also limited in that one has to keep almost all possible cascade paths (equivalent to all possible filter combinations) according to the model to achieve general invariance. Only if the invariance group, which solves the problem at hand is known a priori, one can hard code the invariance network to reduce the feature dimensionality. This is effective when the problem and its invariances are known precisely, but for many image processing applications this is rarely the case. And, the reference does allow for infinite group invariances.
Other attempts to tackle the complicated and extensive training in convolutional neural networks rely heavily on regularization and data augmentation for example by dropout. The Maxout networks (Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. arXiv preprint arXiv:1302.4389, 2013) leverage dropout by introducing a new activation function. The approach improved state of the art results on different common vision benchmarks. Another perspective on reducing sample complexity has been made by Robert Gens and Pedro M Domingos, Deep symmetry networks, In Advances in neural information processing systems, pages 2537-2545, 2014, by introducing deep symmetry networks. These networks apply non-fixed pooling over arbitrary symmetry groups and have been shown to greatly reduce sample complexity compared to convolutional neural networks on NORB and rotated MNIST digits when aggregated over the affine group. Also focussing on modelling invariants is the convolutional kernel network approach introduced by J. Mairal, P. Koniusz, Z. Harchaoui, and C. Schmid. Convolutional kernel networks. NIPS, 2014, which learns parameters of stacked kernels. It achieves impressive classification results with less parameters to learn than a convolutional neural networks.
J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE T-PAMI, 35(8):1872-1885, 2013, according to the abstract: A wavelet scattering network computes a translation invariant image representation, which is stable to deformations and preserves high frequency information for classification. It cascades wavelet transform convolutions with non-linear modulus and averaging operators. The first network layer outputs SIFT-type descriptors whereas the next layers provide complementary invariant information which improves classification. The mathematical analysis of wavelet scattering networks explain important properties of deep convolution networks for classification. A scattering representation of stationary processes incorporates higher order moments and can thus discriminate textures having same Fourier power spectrum. State of the art classification results are obtained for handwritten digits and texture discrimination, with a Gaussian kernel SVM and a generative PCA classifier. This requires a complete set of filters, and/or knowledge of the data set in order to select the relevant filters. Furthermore, rotation, scaling and other need to be taken into account.
HONGLAK LEE ET AL: “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations”, PROCEEDINGS OF THE 26TH ANNUAL INTERNATIONAL CONFERENCE ON MACHINE LEARNING, ICML '09, pp. 1-8, according to its abstract discloses that there has been much interest in unsupervised learning of hierarchical generative models such as deep belief networks. Scaling such models to full-sized, high-dimensional images remains a difficult problem. To address this problem, we present the convolutional deep belief network, a hierarchical generative model which scales to realistic image sizes. This model is translation-invariant and supports efficient bottom-up and top-down probabilistic inference. Key to our approach is probabilistic max-pooling, a novel technique which shrinks the representations of higher layers in a probabilistically sound way. Our experiments show that the algorithm learns useful high-level visual features, such as object parts, from unlabeled images of objects and natural scenes. We demonstrate excellent performance on several visual recognition tasks and show that our model can perform hierarchical (bottom-up and top-down) inference over full-sized images.
JOAN BRUNA ET AL: “Classification with Invariant Scattering Representations”, ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201
OLIN LIBRARY CORNELL UNIVERSITY ITHACA, N.Y. according to its abstract discloses that scattering transform defines a signal representation which is invariant to translations and Lipschitz continuous relatively to deformations. It is implemented with a non-linear convolution network that iterates over wavelet and modulus operators. Lipschitz continuity locally linearizes deformations. Complex classes of signals and textures can be modeled with low-dimensional affine spaces, computed with a PCA in the scattering domain. Classification is performed with a penalized model selection. State of the art results are obtained for handwritten digit recognition over small training sets, and for texture classification.
MARC 'AURELIO RANZATO ET AL: “Unsupervised Learning of Invariant Feature Hierarchies with Applications to Object Recognition”, CVPR '07. IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION; 18-23 Jun. 2007; MINNEAPOLIS, Minn., USA, IEEE, PISCATAWAY, N.J., USA, pp 1-8, according to its abstract discloses to present an unsupervised method for learning a hierarchy of sparse feature detectors that are invariant to small shifts and distortions. The resulting feature extractor consists of multiple convolution filters, followed by a feature-pooling layer that computes the max of each filter output within adjacent windows, and a point-wise sigmoid non-linearity. A second level of larger and more invariant features is obtained by training the same algorithm on patches of features from the first level. Training a supervised classifier on these features yields 0.64% error on MNIST, and 54% average recognition rate on Caltech 101 with 30 training samples per category. While the resulting architecture is similar to convolutional networks, the layer-wise unsupervised training procedure alleviates the over-parameterization problems that plague purely supervised learning procedures, and yields good performance with very few labeled training samples.