Pattern recognition is an aspect of the field of artificial intelligence aiming at providing perceptions to “intelligent” systems, such as robots, programmable controllers, speech recognition systems, artificial vision systems, etc.
In pattern recognition, objects are classified according to some chosen criteria so as to allow these objects to be compared with each other, for example by computing a distance between the objects as a function of the chosen criteria. Accordingly, it is possible, for example, to quantify the similarity or dissimilarity between two objects, to remember an object and to recognize this object later on.
An object, as referred to hereinabove, is not restricted to a physical shape or a visual representation; it has to be understood that an object means any entity that can be represented by a signal.
In general, but not restrictively, the term “distance” can be construed as a mathematical function for measuring a degree of dissimilarity between two objects. For example, if the two objects are assimilated to two respective vectors, this distance can be the Euclidian norm of the difference between the two vectors. The distance could also be, for example, a probability, an error, a score, etc.
Those of ordinary skill in the art of rule-based expert systems, statistical Markovian systems or second generation (formal) neural network systems are familiar with such a concept of “distance”.
Unfortunately, the evaluation of this “distance” is often an important burden. Furthermore, object comparison is usually obtained by first comparing segments of the objects, which involves distance comparison. It has been found desirable to achieve such comparison with a more global approach. For example, comparing N signals would require:                to compute the distance for each combination of two signals among the N signals; and        to find signals that are similar by sorting and comparing the distances obtained therefrom.        
Third generation neural networks, including spiking neural networks and pulsed neural networks, allow to alleviate this distance burden. Indeed, a properly designed spiking neural network allows pattern comparisons and similarity evaluation between different patterns without explicit score or distance computation. This is made by using spiking events that are temporally-organized, as shown in FIG. 1A.
Various schemes are possible for coding temporally-organized spiking neural networks. Two possible schemes are listed below.                i) Synchronization coding: as illustrated in FIGS. 1B and 1C, neurons not discharging at the same time are not synchronized. Conversely, neural synchronization occurs when similar input stimuli are given to the neurons, which discharge synchronously. This is called neurons synchronization.        ii) Rank Order Coding: a neuron spikes only when a specific input sequence of spikes is received on its dendrites.        
This transfer between conventional digital coding and spike sequences coding is efficient in terms of both distance criteria creation and comparison.
To summarize, a distance between two objects can be represented, for example, by:                i) more or less similar spike timing between neurons; or        ii) the process called “Rank Order Coding” characterized by the existence of pairs of excitatory neurons and inhibitory neurons and providing recognition of incoming sequences of spikes from other neurons when a spike is generated by a particular neuron.        
Synchronization coding occurs when two groups of neurons appear spontaneously because of plasticity of interconnections of neurons. Thus, two neurons having similar inputs present a growth of their mutual synaptic connections, causing their outputs to be synchronous. Otherwise, when inputs of neurons are not similar, their mutual synaptic connections decrease, causing them to be desynchronized. In fact, the inputs of two neurons spiking simultaneously are relatively correlated.
Source Separation
Separation of mixed signals is an important problem with many applications in the context of audio processing. It can be used, for example, to assist a robot in segregating multiple speakers, to ease an automatic transcription of video via audio tracks, to separate musical instruments before automatic transcription, to clean a signal before performing speech recognition, etc. The ideal instrumental setup is based on the use of an array of microphones during recording to obtain many audio channels. In fact, in that situation, very good separation can be obtained between noise and signal of interest [1] [2] [3] and experiments with great improvements have been reported in speech recognition [4] [5]. Further applications have been ported on mobile robots [6] [7] [7] and have also been developed to track multi-speakers [51].
A source separation process implies segregation and/or fusion (integration), usually based on methods such as correlation, statistical estimation and binding applied on features extracted by an analysis module.
Conventional approaches require training, explicit estimation, supervision, entropy estimation, huge signals databases [7], AURORA database [10] [34], etc. Therefore, design and training of such systems are tedious, time consuming and, therefore, costly.
Moreover, in many situations, only one channel is available to an audio engineer who, nevertheless, has to solve the separation problem. In this case, automatic separation and segregation of the sources is particularly difficult.
Although some known monophonic systems perform reasonably well on specific signals (generally voiced speech), they fail to efficiently segregate a broad range of signals. These disappointing results could potentially be overcome by combining and exchanging expertise and knowledge between engineering, psychoacoustic, physiology and computer science.
Monophonic source separation systems can be seen as performing two main operations: analyzing a signal for yielding a representation suitable for the second operation which is clustering with segregation.
With at least two interfering speakers each generating voiced speech, it is observed that when there is a difference in their respective pitch, separation is relatively easy since spectral representations or auditory images exhibit different regions with structures dominated by a respective pitch. Then, amplitude modulation of cochlear filter outputs (or modulation spectrograms) is discriminative.
In situations where speakers have similar pitches, separation is more difficult to perform. Features, such as phase, have to be preserved when they undergo an analysis. Glottal opening time should be taken into account otherwise long term information such as intonation would be required. However, in the latter case, real-time treatment becomes problematic.
Using Bregman's terminology, bottom-up processing corresponds to primitive processing, and top-down processing means schema-based processing [10]. Auditory cues proposed by Bregman [10] for simple tones are not applicable directly to complex sounds. More sophisticated cues based on different auditory maps are thus desirable. For example, Ellis [11] uses sinusoidal tracks created by an interpolation of spectral picks of the output of a cochlear filter bank, while Mellinger's model [50] uses partials. A partial is formed if an activity on the onset maps (the beginning of an energy burst) coincides with an energy local minimum of the spectral maps. Using these assumptions, Mellinger proposed a CASA (Competitional Auditory Scene Analysis) system in order to separate musical instruments. Cooke [12] introduced harmony strands, which is a counterpart of Mellinger's cues in speech. The integration and segregation of streams is done using Gestalt and Bregman's heuristics. Berthommier and Meyer use Amplitude Modulation maps [4] [13] [14]. Gaillard [16] uses a more conventional approach by using the first zero crossing for the detection of pitch and harmonic structures in the frequency-time map. Brown proposes an algorithm [17] based on the mutual exclusivity Gestalt principle. Hu and Wang use a pitch tracking technique [18]. Wang and Brown [19] use correlograms in combination with bio-inspired neural networks. Grossberg [20] proposes a neural architecture that implements Bregman's rules for simple sounds. Sameti [9] uses HMMs (Hidden Markov Models), while Roweis [21] and Reyes-Gomez [22] use Factorial HMMs. Jang and Lee [22] use a technique based on Maximum a posteriori (MAP) criterion. Another probability-based CASA system is proposed by Cooke [23]. Irino and Patterson [24] propose an auditory representation that is synchronous to glottis and preserves fine temporal information, which makes possible the synchronous segregation of speech. Harding and Meyer [23] use a model of multi-resolution with parallel high-resolution and low-resolution representations of the auditory signal. They propose an implementation for speech recognition. Nix [25] performs a binaural statistical estimation of two speech sources by an approach that integrates temporal and frequency-specific features of speech. This approach tracks magnitude spectra and direction on a frame-by-frame basis.
A major drawback of the above-mentioned systems is that they require training and supervision.
An alternative to supervised systems are autonomous bio-inspired and spiking neural networks.
Dynamic, non linear space and time filtering of spikes in neural networks combined with neurotransmitters diffusion along with the topographic organization of neurones yields simultaneous signal processing and recognition. Moreover, spiking allows the encoding of information into a second time scale that is different from usual time. This second time scale encodes the relative timing of spiking neurones. Synchronization or generation of specific spiking temporal sequences becomes an auto-organization criteria (Abeles [A1]). This is a feature that allows unsupervised training and has a strong impact on the pattern recognition aptitudes of spiking neural networks (Wang [A2]). Furthermore, neural networks with dynamic synapses and varying delays offer a greater computing capacity than those where only weights are changed (Schmitt [A3] and Maass [A4]). Autonomous bio-inspired and spiking neural networks therefore constitute an alternative to supervised systems (NN handbook [A5], Maass [A6]).
A well known amazing characteristic of human perception is that recognition of stimuli is quasi-instantaneous, even if the information propagation speed in living neurons is slow [18] [26] [27]. This implies that neural responses are conditioned by previous events and states of a neural sub-network [7]. Understanding of the underlying mechanisms of perception in combination with that of the peripheral auditory system [28] [17] [29] [30] allows designing of an analysis module.
In a context of a mathematical formalism of spiking neurons, it has been shown that networks of spiking neurons are computationally more powerful than models based on McCulloch Pitts neurons [9]. Information about the result of a computation is already present in a current neural network state long before the complete spatiotemporal input patterns have been received by the neural network [7]. This suggests that neural networks use a temporal order of first spikes for yielding ultra-rapid computation [31]. Thus, neural networks and dynamic synapses (including facilitation and depression) are equivalent to a given quadratic filter that can be approximated by a small neural system [32] [33]. It has been shown that any filter that can be characterized by Volterra series can be approximated with a single layer of neurons. Also, spike coding in neurons is close to optimal, and plasticity in Hebbian learning rule increases mutual information close to optimal [34] [35] [36].
For unsupervised systems, novelty detection allows facilitating autonomy. For example, it can allow robots to detect whether a stimulus is new or has already been encountered. When associated with conditioning, novelty detection can create autonomy of the system [10] [37].
Sequence classification is particularly interesting for speech. Panchev and Wermter [46] have shown that synaptic plasticity can be used to perform recognition of sequences. Perrinet [?] and Thorpe [?] discuss the importance of sparse coding and rank order coding for classification of sequences.
Assemblies, or groups of spiking neurons can be used to implement segregation and fusion, i.e. integration of objects in an auditory image, in other words signal representation. Usually, in signal processing, correlations or distances between signals are implemented with delay lines, products and summations. Similarly, comparison between signals can be made with spiking neurons without implementation of delay lines. This is achieved by presenting images, i.e. signals, to spiking neurons with dynamic synapses. Then, a spontaneous organization appears in the network with sets of neurons firing in synchrony. Neurons with the same firing phase belong to the same auditory objects. Milner [38] and Malsburg [39] [40] propose a temporal correlation to perform binding. Milner and Malsburg have observed that synchrony is a crucial feature to bind neurons associated to similar characteristics. Objects belonging to the same entity are bound together in time. In other words, synchronization between different neurons and desynchronization among different regions perform the binding. To a certain extent, such property has been exploited to perform unsupervised clustering for recognition on images [41], for vowel processing with spike synchrony between cochlear channels [42], to propose pattern recognition with spiking neurons [43], and to perform cell assembly of spiking neurons using Hebbian learning with depression [44]. Furthermore, Wang and Terman [45] have proposed an efficient and robust technique for image segmentation and study the potential in CASA [19].
Pattern Recognition
Pattern recognition robust to noise, symmetry, homothety (size change with angle preservation), etc. has long been a challenging problem in artificial intelligence. Many solutions or partial solutions to this problem have been proposed using expert systems or neural networks. In general, three different approaches are used to perform invariant pattern recognition.
Normalization
In this approach the analyzed object is normalized to a standard position and size by an internal transformation. Advantages of this approach include i) coordinate information (the “where” information) is retrievable at any stage of the processing and ii) there is a minimum loss of information. The disadvantage of this approach is that a network must find an object in a scene and then normalize it. This task is not as obvious as it can appear [46], [47].
Invariant Features
In this approach, some features that are invariant to the location and size of an object are extracted. A disadvantage of this approach are that the position of the object may be difficult to access because of a possibility of loosing information, such as recognition, during the extraction process. The advantage is that this approach does not require knowledge of the position of the object and, unlike normalization that must be followed by an operation of pattern recognition, the invariant features approach already does some pattern recognition by finding important features [48].
Invariance Learning from Temporal Input Sequences
The assumption is that primary sensory signals, in general code for local properties, vary quickly while the perceived environment changes slowly. Succeeding in extracting slow features from a quickly varying sensory signal is likely to result in obtaining an invariant representation of the environment [6] [8].
Based on the normalization approach, a dynamic link matching (DLM) approach has been first proposed by Konen et al [46]. This approach consists of connecting two layers of neurons through synaptic connections that are constrained by a normalization. A known pattern is applied to one of the two layers, and the pattern to be recognized to the other layer. Dynamics of the neurons are chosen in such a way that “blobs” are formed randomly in the layers. If features of the blobs respectively in the two layers are similar enough, a weight strengthening and an activity similarity will be detected between the two layers, for example by correlation computation [49] [46]. These blobs may or may not correspond to a segmented region of a visual scene, since their size is fixed in the whole simulation period and is chosen by some parameters in the dynamics of the network [46]. The apparition of blobs in the network has been linked to the attention process present in the brain by the developers of the architecture.
The dynamics of the neurons used in the original DLM network are not dynamics of a spiking neuron. In fact, the behavior of neurons from a DLM network is based on rate coding, i.e. average neuron activity over time, and can be shown to be equivalent to an enhanced dynamic Kohonen Map in its Fast Dynamic Link Matching (FDLM) [46].
In summary, the systems described hereinabove are supervised and non-autonomous, or include two operating modes which are learning and recognition.
Other systems such as those described in U.S. Pat. No. 6,242,988 B1 (Sarpeshkar) issued on Jun. 5, 2001 and entitled “Spiking Neural Circuit”, and U.S. Pat. No. 4,518,866 issued to Clymer on May 21, 1985 and entitled “Method of and Circuit for Simulating Neurons”, make use of bio-inspired neural networks (or spiking neurons) including electronic circuitry to implement neurons, but do not provide any solution to spatio-temporal pattern.
The other following United States patent documents describe solutions to spatio-temporal pattern recognition that do not use bio-inspired neural networks (spiking neurons). They either use conventional (non-spiking) neural networks or expert systems:
No.TitleIssuedInventor5,664,065Pulse-coupledSep. 02, 1997Johnsonautomatic objectrecognition systemdedicatory clause5,255,348Neural network forOct. 19, 1993Nenovlearning, recognitionand recall of patternsequences6,067,536Neural network forMay 23, 2000Maruyamavoice and patternet al.recognition2003/0228054NeurodynamicDec. 11, 2003Decomodel of theprocessing of visualinformation