At first the background of the processing architecture will be explained. The concept of convergent hierarchical coding assumes that sensory processing in the brain can be organized in hierarchical stages, where each stage performs specialized, parallel operations that depend on input from earlier stages. The convergent hierarchical processing scheme can be employed to form neural representations which capture increasingly complex feature combinations, up to the so-called “grandmother cell”, that may fire only if a specific object is being recognized, perhaps even under specific viewing conditions. The main criticism against this type of hierarchical coding is that it may lead to a combinatorial explosion of the possibilities which must be represented, due to the large number of combinations of features which constitute a particular object under different viewing conditions (von der Malsburg, C. (1999), “The what and why of binding: The modeler's perspective”, Neuron, 24, 95-104).
In the recent years several authors have suggested approaches to avoid such a combinatorial explosion for achieving invariant recognition. The main idea is to use intermediate stages in a hierarchical network to achieve higher degrees of invariance over responses that correspond to the same object, thus reducing the combinatorial complexity effectively.
Since the work of Fukushima, who proposed the Neocognitron as an early model of translation invariant recognition, two major processing modes in the hierarchy have been emphasized: Feature-selective neurons are sensitive to particular features which are usually local in nature. Pooling neurons perform a spatial integration over feature-selective neurons which are successively activated, if an invariance transformation is applied to the stimulus. As was recently emphasized by Mel, B. W. & Fiser, J. (2000), “Minimizing binding errors using learned conjunctive features”, Neural computation 12(4), 731-762, the combined stages of local feature detection and spatial pooling face what could be called a stability-selectivity dilemma. On the one hand excessive spatial pooling leads to complex feature detectors with a very stable response under image transformations. On the other hand, the selectivity of the detector is largely reduced, since wide-ranged spatial pooling may accumulate too many weak evidences, increasing the chance of accidental appearance of the feature.
Despite its conceptual attractivity and neurobiological evidence, the plausibility of the concept of hierarchical feed-forward recognition stands or falls by the successful application to sufficiently difficult real-world 3D invariant recognition problems. The central problem is the formulation of a feasible learning approach for optimizing the combined feature-detecting and pooling stages. Apart from promising results on artificial data and very successful applications in the realm of hand-written character recognition, applications to 3D recognition problems (Lawrence, S., Giles, C. L., Tsoi, A. C., & Back, A. D. (1997), “Face recognition: A convolutional neural-network, approach”, IEEE Transactions on Neural Networks 8(1), 98-113) are exceptional. One reason is that the processing of real-world images requires network sizes that usually make the application of standard supervised learning methods like error backpropagation infeasible. The processing stages in the hierarchy may also contain network nonlinearities like Winner-Take-All, which do not allow similar gradient-descent optimization. Of great importance for the processing inside a hierarchical network is the coding strategy employed. An important principle is redundancy reduction, that is a transformation of the input which reduces the statistical dependencies among elements of the input stream. Wavelet-like features have been derived which resemble the receptive fields of V1 cells either by imposing sparse overcomplete representations (Olshausen, B. A. & Field, D. J. (1997), “Sparse coding with an overcomplete basis set: A strategy employed in V1”, Vision Research, 37, 3311-3325) or imposing statistical independence as in independent component analysis (Bell, A. J. & Sejnowski, T. J. (1997), “The ‘independent components’ of natural scenes are edge filters”, Vision Research, 37, 3327-3338). These cells perform the initial visual processing and are thus attributed to the initial stages in hierarchical processing.
Apart from understanding biological vision, these functional principles are also of great relevance for the field of technical computer vision. Although ICA (Independent Component Analysis) has been discussed for feature detection in vision by several authors, there are only few references for its usefulness in invariant object recognition applications. Bartlett, M. S. & Sejnowski, T. J. (1997), “Viewpoint invariant face recognition using independent component analysis and attractor networks”, In M. C. Mozer, M. I. Jordan, & T. Petsche (Eds.), “Advances in Neural Information Processing Systems”, Volume 9, pp. 817, The MIT Press, showed that for face recognition ICA representations have advantages over PCA (Principal Component Analysis)-based representations with regard to pose invariance and classification performance.
Now the use of hierarchical networks for pattern recognition will be explained.
An essential problem for the application to recognition tasks is which coding principles are used for the transformation of information in the hierarchy and which local feature representation is optimal for representing objects under invariance. Both properties are not independent and must cooperate to reach the desired goal. In spite of its conceptual attractivity, learning in deep hierarchical networks still faces some major drawbacks. The following review will discuss the problems for the major approaches, which were considered so far.
Fukushima, K. (1980), “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position”, Biol. Cyb., 39, 139-202, introduced with the Neocognitron a principle of hierarchical processing for invariant recognition, that is based on successive stages of local template matching and spatial pooling. The Neocognitron can be trained by unsupervised, competitive learning, however, applications like hand-written digit recognition have required a supervised manual training procedure. A certain disadvantage is the critical dependence of the performance on the appropriate manual training pattern selection (Lovell, D., Downs, T., & Tsoi, A. (1997), “An evaluation of the neocognitron”, IEEE Trans. Neur. Netw., 8, 1090-1105) for the template matching stages. The necessity of teacher intervention during the learning stages has so far made the training infeasible for more complex recognition scenarios like 3D object recognition.
Riesenhuber, M. & Poggio, T. (1999), “Are cortical models really bound by the “binding problem”?”, Neuron, 24, 87-93, emphasized the point that hierarchical networks with appropriate pooling operations may avoid the combinatorial explosion of combination cells. They proposed a hierarchical model with similar matching and pooling stages as in the Neocognitron. A main difference are the nonlinearities which influence the transmission of feedforward information through the network. To reduce the superposition problem, in their model a complex cell focuses on the input of the presynaptic cell providing the largest input. The model has been applied to the recognition of artificial paper clip images and computer-rendered animal and car objects (Riesenhuber, M. & Poggio, T. (1999b), “Hierarchical models of object recognition in cortex”, Nature Neuroscience 2(11), 1019-1025) and uses a local enumeration scheme for defining intermediate combination features.
From Y. Le Cun et al (“Hand-written digit recognition with back-propagation network”, 1990, in advances in neural information processing systems 2, pp. 396-404) a multi-layer network is known. An input image is scanned with a single neuron that has a local receptive field, and the states of this neuron are stored in corresponding locations in a layer called a feature map. This operation is equivalent to a convolution with a small size kernel. The process can be performed in parallel by implementing the feature map as a plane of neurons whose weights vectors are constrained to be equal. That is, units in a feature map are constrained to perform the same operation on different parts of the image. In addition, a certain level of shift invariance is present in the system as shifting the input will shift the result on the feature map, but will leave it unchanged otherwise. Furthermore it is proposed to have multiple feature maps extracting different features from the same image. According to this state of the art the idea of local, convolutional feature maps can be applied to subsequent hidden layers as well, to extract features of increasing complexity and abstraction. Multi-layered convolutional networks have been widely applied to pattern recognition tasks, with a focus on optical character recognition, (see LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998), “Gradient-based learning applied to document recognition”, Proceedings of the IEEE, 86, 2278-2324 for a comprehensive review). Learning of optimal features is carried out using the backpropagation algorithm, where constraints of translation invariance are explicitly imposed by weight sharing. Due to the deep hierarchies, however, the gradient learning takes considerable training time for large training ensembles and network sizes. Lawrence, S., Giles, C. L., Tsoi, A. C., & Back, A. D. (1997), “Face recognition: A convolutional neural-network approach”, IEEE Transactions on Neural Networks 8(1), 98-113 have applied the method augmented with a prior vector quantization based on self-organizing maps for dimensionality reduction and reported improved performance for a face classification setup.
Now applications of hierarchical models on the invariant recognition of objects will be shortly explained.
U.S. Pat. No. 5,058,179 relates to a hierarchy constrained automatic learning network for character recognition. Highly accurate, reliable optical character recognition thereby is afforded by the hierarchically layered network having several layers of several constrained feature detection for localized feature extraction followed by several fully connected layers for dimensionality reduction. The character classification is performed in the ultimate fully connected layer. Each layer of parallel constrained feature detection comprises a plurality of constrained feature maps and a corresponding plurality of kernels wherein a predetermined kernel is directly related to a single constrained feature map. Undersampling can be performed from layer to layer.
U.S. Pat. No. 5,067,164 also discloses a hierarchical constrained automatic learning neural network for recognition having several layers of constrained feature detection wherein each layer of constrained feature detection includes a plurality of constrained feature maps and a corresponding plurality of feature reduction maps. Each feature reduction map is connected to only one constrained feature map in the layer for undersampling that constrained feature map. Units in each constrained feature map of the first constrained feature detection layer respond as a function of a corresponding kernel and of different portions of the pixel image of the character captured in a receptive field associated with the unit. Units in each feature map of the second constrained feature detection layer respond as a function of a corresponding kernel and of different portions of an individual feature reduction map or a combination of several feature reduction maps in the first constrained feature detection layer as captured in a receptive field of the unit. The feature reduction maps of the second constrained feature detection layer are fully connected to each unit of the final character classification layer. Kernels are automatically learned by the error backpropagation algorithm during network initialization or training. One problem of this approach is that learning must be done for all kernels simultaneously in the hierarchy, which makes learning too slow for large networks. This has so far precluded the application of this kind of convolutional networks to more difficult problems of three-dimensional invariant object recognition.
U.S. Pat. No. 6,038,337 discloses a method and an apparatus for object recognition using a hybrid neural network system exhibiting a local image sampling, a self-organizing map neural network for dimension reduction and a hybrid convolutional network. The hybrid convolutional neural network provides for partial invariance to translation, rotation, scale and deformation. The hybrid convolutional network extracts successively larger features in a hierarchical set of layers. As an example application face recognition of frontal views is given.