1. Field of the Invention
The present invention relates to an apparatus and a method mainly designed for recognizing a pattern or detecting a particular subject by using a parallel arithmetic device, such as a neural network.
2. Description of the Related Art
Hitherto, in the field of image recognition or speech recognition, the methods available for implementing such recognition have been roughly classified into two types. In one type, such recognition is implemented by performing serial arithmetic operation by employing, as a computer software program, a recognition processing algorithm tailored to a specific object to be recognized. In the other type, such recognition is implemented using a dedicated parallel image processor, such as a single instruction multiple data stream (SIMD) machine or a multiple instruction multiple data stream (MIMD) machine.
The following will describe typical examples of image recognition algorithms. First, examples wherein a feature quantity regarding the similarity to a model to be recognized is calculated include a method in which the data regarding the model to be recognized is represented in the form of a template model, and the similarity is calculated on the basis of matching or the like between an input image (or the feature vector thereof) and the template, or a high order correlation coefficient is calculated, a method in which an input pattern is mapped into an eigen-image function space obtained by performing the principal component analysis of a model image of an object so as to calculate the distance to the model in a feature space (Sirovich, et al., 1987, Low-dimensional procedure for the characterization of human faces, J.Opt.Soc.Am.[A], vol. 3, pp. 519–524), a method in which the relationship between a plurality of feature extraction results (feature vectors) and the spatial dispositions thereof is shown on a graph so as to calculate the similarity based on elastic graph matching (Lades et. al. 1993, Distortion Invariant Object Recognition in the Dynamic Link Architecture, IEEE Trans. On Computers, vol. 42, pp. 300–311), and a method in which an input image is subjected to predetermined conversion to obtain an expression with position, rotation, and scale invariance, then checked against a model (Seibert, et al. 1992, Learning and recognizing 3D objects from multiple views in a neural system, in Neural Networks for Perception, vol. 1 Human and Machine Perception (H. Wechsler Ed.) Academic Press, pp. 427–444).
Methods for recognizing a pattern based on a neural network model obtained from a clue based on an information processing mechanism of a living organism include a method based on a hierarchical template matching (Japanese Examined Patent Publication No. 60-712, Fukushima & Miyake, 1982 Neocognitron: A new algorithm for pattern recognition tolerant of deformation and shifts in position, Pattern Recognition, vol. 15, pp. 455–469), a method in which an object-centered expression with scale and position invariance is obtained by a dynamic routing neural network (Anderson, et al. 1995, Routing Networks in Visual Cortex, in Handbook of Brain Theory and Neural Networks (M. Arbib, Ed.), MIT Press, pp. 823–826), and a method based on a multilayer perceptron and a radial basis function network.
As an attempt to faithfully model the information processing mechanism based on the neural network of a living organism, there have been proposed neural network model circuits for performing transmission representation using a train of pulses corresponding to action potentials (Murray et al., 1991 Pulse-Stream VLSI Neural Networks Mixing Analog and Digital Techniques, IEEE Trans. On Neural Networks, vol. 2, ppl 93–204; Japanese Patent Laid-Open No. 7-262157, Japanese Patent Laid-Open No. 7-334478, Japanese Patent Laid-Open No. 8-153148, the gazette of Patent No. 2879670, etc.).
In implementing the prior arts mentioned above by using circuitry or the like, especially concerning the application to image recognition, there have been no methods available for representing information regarding a two-dimensional pattern by utilizing dynamic characteristic on a time base, and using the represented information for recognition or the like at any level of an a processing device, a module, or a system, which makes up a function of each processing unit (e.g. the extraction of a feature).
More specifically, in many cases, it has been assumed that the processing is progressed on the basis of a transition pattern formed of a finite number of states (typically composed of fire and non-fire) at a certain time of a processing element or module in which spatial pattern information has been spatially disposed. In addition, the application has been limited to the processing in a domain of digital representation of information.
For the reason described above, the information processing capability has been limited, and the implementation in the form of circuitry tends to inevitably involve an unduly large scale and high cost. In particular, the proportion of a large number of wiring for the connection among neurons has been considerably high in the entire area, and this has been posing a problem.
Therefore, as a solution to the wiring problem in neural networks, there has been proposed a method in which the addresses of pulse output neurons are encoded in an event driven manner known as an address event representation (hereinafter referred to “AER”) (Lazzaro, et al. 1993, Silicon Auditory Processors as Computer Peripherals, In Tourestzky, D.(ed), Advances in Neural Information Processing Systems 5. San Mateo, Calif.:Morgan Kaufmann Publishers). According to this method, the IDs of the neurons outputting trains of pulses are encoded in a binary mode as addresses so as to allow the neurons that receive the addresses to automatically decode the addresses of the originating neurons even if the output signals from different neurons are temporally arranged on the same bus.
The AER, however, has been presenting a problem in that a device for sequentially coding and decoding the addresses of neurons is required, making a circuit configuration complicated.
There is another method available for recognizing or detecting a specific object by a neural network formed of neurons generating trains of pulses. This method employs a model of a high order (a second order or higher) by Eckhorn, et al. that is based on linking inputs and feeding inputs (Eckhorn, et al. 1990, Feature linking via synchronization among distributed assemblies: Simulation of results from cat cortex, Neural Computation, Vol. 2, pp. 293–307), i.e., a pulse coupled neural network (hereinafter referred to as “PCNN”)(U.S. Pat. No. 5,664,065 and Broussard, et al. 1999, Physiologically Motivated Image Fusion for Object Detection using a Pulse Coupled Neural Network, IEEE Trans. On Neural Networks, vol. 10, pp. 554–563, etc.).
However, no literature, including the literature concerning PCNN mentioned above has disclosed any specific configurations based on a neural network in a method for implementing a recognition function by utilizing analog information, such as the interval between spikes, of a train of spikes in a time base domain for the coding or the like of image information in a neural network model for carrying out predetermined processing by inputting, outputting, or transmitting trains of spikes.
Regarding an image recognition algorithm, a system has been sought after, whose performance, in particular, is independent of the position, size, etc. of an object to be recognized on a screen. Many systems have been proposed in the past to respond to such needs. For example, recognition invariant against changes in scale or rotation can be achieved by carrying out “conformal mapping conversion” as preprocessing.
To be more specific, the Log-Polar coordinate transform is carried out on a logarithm of the distance from the central point of an object to be recognized in an image and the rotational angle thereof. This causes a change in size or rotation of the same object to be converted into a parallel movement on a coordinate system after the conversion. Thereafter, when a feature quantity (e.g. a correlation coefficient) is calculated, the object to be recognized will be detected in terms of the same feature quantity. The invariance of detection performance against positional changes can be obtained by sequentially shifting the central point of the conversion so as to perform detection at each position.
Furthermore, there has been pointed out the possibility of performing similar size-invariant detection by obtaining multi-scale representation for each local region on a given image and further carrying out the conformal mapping conversion mentioned above (Wechsler, H. 1992, “Multi-scale and Distributed Visual Representations and Mappings for Invariant-Low-Level Perception” in Neural Networks for Perception, Vol. 1, Wechssler H. Ed. pp. 462–476., Academic Press, Boston).
Thus, a method in which the conventional predetermined mapping conversion (conformal mapping conversion, etc.) is performed to implement pattern recognition wherein recognition performance is invariant on objects to be recognized that have different scales has been posing a problem in that it is difficult to obtain scale-invariant features unless the central point of conversion is properly set.
The following will describe the examples using another type wherein the feature quantity regarding the similarity to the model of an object to be recognized is calculated, and the recognition can be achieved without relying on size. One of such examples is a method in which the model data of an object to be recognized is represented in varying scales as template models beforehand, and template matching with an input image or its feature vectors is carried out from coarse to fine (Rosenfeld and Vanderburg, 1977, Coarse to fine template matching, IEEE Trans. Systems, Man, and Cybernetics, vol. 2, pp. 104–107). In another method, an input pattern is mapped onto an eigen-image function space obtained by performing the principal component analysis of model images of objects in varying sizes, and the distance from models in a feature space is calculated (Japanese Patent Laid-Open No. 8-153198, Murase, Nayar, 1995, Image spotting of 3D object by multiple resolution and eigenspace representation, Information Processing Academy Proceedings, vol. 36, pp. 2234–2243; Murase and Nayar, 1997, Detection of 3D objects in cluttered scenes using hierarchical eigenspace, Pattern Recognition Letters, pp. 375–384). In yet another method, the position and size of a corresponding region are calculated and normalized on the basis of the distance image data on an object to be recognized, then matching is performed (Japanese Patent Laid-Open No. 5-108804). In still another method, the multiple resolution data regarding an object to be recognized is shifted in the order from a low resolution level to a high resolution level thereby to perform recognition, including matching (Japanese Patent Laid-Open No. 8-315141).
The method based on the template matching has been presenting a problem in terms of practicality in that, when performing matching with the template models in different scales that have been represented beforehand, high recognition performance cannot be achieved unless an object in an input image substantially matches with one of the scales, meaning that numerous different template models are required.
In the method disclosed in Japanese Patent Laid-Open No. 8-153198 wherein the parametric eigenspace obtained by performing the principal component analysis of model images of an object on a finite number of different sizes, the changes in size are represented by manifolds on a parametric eigenspace, so that objects in different sizes can be successively recognized. This method, however, has been presenting a problem in that the dimension of covariance matrix is large (e.g., 16,384 dimensions in the case presented by Murase and Nayar in 1995), inevitably requiring enormously high cost for calculating eigenvectors. In order to obtain adequate accuracy to successfully deal with the changes in size, reference images having different sizes of about five steps of 1.1-fold, 1.2-fold, 1.3-fold, 1.4-fold, and 1.5-fold (=α) of a reference size must be prepared to calculate eigenvectors, then an input image must be converted into sizes of α−1-fold, α−2-fold, α−3-fold, etc. This has been requiring an extremely large memory space and an enormously long time for computation to complete the processing.
According to the method disclosed in Japanese Patent Laid-Open No. 8-315141, the matching is performed from low resolution to high resolution in sequence on multi-resolution representation data regarding an object that has been prepared in advance. Therefore, to perform scale-invariant recognition, it is necessary to provide a sufficiently high multiplexing level of resolutions to be prepared beforehand, leading to poor processing efficiency. For this reason, the method may be suited for acquiring rough information by using less memories, but unsuited for recognition or detection requiring high accuracy.
As a method employing time-series input images, there is one in which a plurality of hypotheses regarding an object to be recognized that compete with each other are generated from an image, the hypotheses are temporally accumulated, then input to a category classifier, such as ART2 by Carpenter et al. (Seibert, et al. 1992, Learning and recognizing 3D objects from multiple views in a neural system, in Neural Networks for Perception, vol. 1 Human and Machine Perception (H. Wechsler Ed.) Academic Press, pp. 427–444).
As a pattern recognizing method based on a neural network model obtained from a clue based on an information processing mechanism of a living organism, there is one in which a dynamic routing network is used to obtain scale- and position-invariant representation centering around an object (Anderson, et al. 1995, Routing Networks in Visual Cortex, in Handbook of Brain Theory and Neural Networks (M. Arbib, Ed.), MIT Press, pp. 823–826, Olhausen et al. 1995, A Multiscale Dynamic Routing Circuit for Forming Size- and Position-Invariant Object Representations, J. Computational Neuroscience, vol. 2 pp. 45–62). According to this technique, a hierarchical representation (multi-resolution representation) based on a plurality of different resolutions is made in advance on image data, and information routing is performed through the intermediary of a control neuron that has a function for dynamically setting connection weight, thereby mapping the information at different resolutions onto a representation centering around an object.
The method based on the dynamic routing network (Anderson et al., 1995; Olshausen et al., 1995) requires a mechanism for dynamically setting the connection between nerve cell elements between predetermined scale levels by a local competing process among control neurons, thus presenting a problem in that the circuit configuration inevitably becomes complicated.
The method in which competing hypotheses are generated and input to a category classifier (Seibert et al. 1992) is based on time-series images, making it inherently difficult to accomplish scale-independent recognition from a single still picture.
In an image recognition algorithm, it is considered important to reduce the computation cost or weight required for recognition processing by accomplishing recognition typically by selectively shifting attended region from analogy with a biological system processing.
For instance, according to the hierarchical information processing method disclosed in Japanese Examined Patent Publication No. 6-34236, a plurality of descending signal pathways that are directed from an upper layer to a lower layer to match a plurality of ascending signal pathways directed from a lower layer to an upper layer are provided among a plurality of hierarchies that have feature extracting element layers and feature integration layers that provide outputs based on the outputs from feature extracting elements associated with the same feature. The transmission of ascending signals is controlled in response to the descending signals from an uppermost layer so as to perform segmentation by selectively extracting self-recollecting associative capability and a specific pattern, thereby setting a processing region or a fixation region for recognition.
First, the method disclosed in Japanese Examined Patent Publication No. 6-34236 is based on an assumption that there is a descending signal pathway paired with an ascending signal pathway in a hierarchical neural circuit configuration. Hence, a neural circuit of approximately as large as the neural circuit corresponding to the ascending signal pathway is required as a circuit that forms the descending signal pathway, disadvantageously resulting in an extremely large circuit scale.
In addition, this method is provided with no mechanism for controlling the sequential change of fixation positions, posing a problem in that the operation is unstable when setting or changing an attended region due to the influences of noises or other factors. More specifically, interactions exist throughout all hierarchies between the elements of an intermediate layer of the ascending pathway and the elements of an intermediate layer of the descending pathway, and a fixation position is finally determined through all the interactions. This has been presenting a problem in that, if there are a plurality of objects that fall within the same category, then the positions of fixation points are not controlled to sequentially shift among the objects in a stable manner, causing a fixation position to librate only between particular objects or in the vicinity thereof.
There has been another problem in that, if there are a plurality of objects that fall within the same category as that of an object to be detected or recognized in input data, then subtle adjustment of network parameters must be made whenever processing for a plurality of objects (substantially non-attention processing) simultaneously occurs or whenever attention position updating is performed.
According to U.S. Pat. No. 4,876,731, the aforesaid ascending signal pathway and the descending signal pathway are controlled on the basis of contextual information (rule data base, probabilistic weighting) from an output layer, i.e., an uppermost layer.
According to U.S. Pat. No. 2,856,702, modifying recognition refers to attention. A pattern recognizing apparatus is provided with an attention degree determiner for selecting an attention degree for each part region in order to accomplish accurate recognition if a pattern cannot be identified, and an attention degree controller for performing recognition in which the selected attention degree has been reflected.
A system for controlling the setting of an attention region by selective routing, which has been proposed by Koch and Ullman in 1985 (Human Neurobiology, vol. 4, pp. 219–227), is provided with a salience level map extracting mechanism combining feature extraction and a selective mapping mechanism, an attention position selecting mechanism employing a “winner-take-all” (hereinafter referred to “WTA”) neural network (refer to Japanese Patent Laid-Open No. 5-242069, U.S. Pat. No. 5,049,758, U.S. Pat. No. 5,059,814, U.S. Pat. No. 5,146,106, etc.), and a mechanism for inhibiting neural elements at a selected position.
In the system based on the selective routing described above, in the control of attended positions, it is not easy to efficiently control the attended position because the system is equipped only with a mechanism for inhibiting a selected region. Hence, there have been some cases where the control of the positions of attended points is focused on a particular object or a particular portion.
According to the method based on the aforesaid dynamic routing network, information routing is performed through the intermediary of a control neuron that has a function for dynamically setting connection weight thereby to control attended regions and convert a feature representation of an object to be recognized that is centered on an observer into a representation centered on the object.
However, the system using the dynamic routing network reconfigures interlayer connection through the intermediary of many control neurons for which synaptic connection weight can be dynamically changed, so that the circuit configuration inevitably becomes complicated. In addition, the control neurons involve bottom-up processing based on the feature salience level of a lower layer, so that it has been difficult to achieve efficient control of attended positions if a plurality of objects of the same category are present.
In the method based on selective tuning (Culhane & Tsotsos, 1992, An Attentional Prototype for Early Vision. Proceedings of Second European Conference on Computer Vision, (G. Sadini Ed.), Springer-Verlag, pp. 551–560), search is carried out by a mechanism that activates only winners from a WTA circuit of an uppermost layer to a WTA circuit of a lower layer so as to decide the position of an overall winner of the uppermost layer, in a lowermost layer, which is the layer directly receiving input data. The selection of a position and the selection of a feature that are made in attention control are implemented by inhibiting a connection irrelevant to the position of an object and by inhibiting an element that detects a feature irrelevant to the object.
The system based on selective tuning hierarchically and dynamically performs pruning-like selection in which a connection not related to a selected object is merely pruned. This has been posing a problem in that, if a plurality of objects are present, then it is not easy to efficiently control the positions of fixation points.
Furthermore, the prior arts described above have the following disadvantages in common.
Firstly, none of the systems described above are provided with mechanisms capable of dealing with different sizes of objects to be recognized. Therefore, if objects in different sizes simultaneously exist at a plurality of positions, then it has been required to tune network parameters to each of a plurality of scenes wherein the sizes of the objects to be recognized are different.
Secondly, if there are a plurality of objects that belong to the same category at a plurality of positions in input data, it has been impossible to evenly and efficiently shift or update attended positions across the objects.
Incidentally, it has been generally known that using analog circuit elements makes it possible to achieve a simplified circuit configuration, which means less elements, higher speed, and lower power consumption, as compared with a digital system. On the other hand, however, the circuit configurations using analog circuit elements have been presenting problems with immunity to noises, and reliability of input/output features attributable to variations in the characteristic of individual elements.