An automatic speech recognition process can be schematically described as a number of modules placed in series between a speech signal as input and a sequence of recognised words as output:                a first signal processing module, which acquires the input speech signal, transforming it from analogue to digital and suitably sampling it;        a second feature extraction module, which computes a set of parameters that well describe the features of the speech signal in terms of its recognition. This module uses, for example, spectral analysis (DFT) followed by grouping in Mel bands and a discrete transformed cosine (Mel based Cepstral Coefficients);        a third module that uses temporal alignment algorithms and acoustic pattern matching; for example a Viterbi algorithm is used for temporal alignment, that is to say it manages the time distortion introduced by the different rates of speech, while for pattern matching it is possible to use prototype distances, the likelihood of Markovian states or a posteriori probability generated by neural networks;        a fourth linguistic analysis module for extracting the best word sequence (present only for recognition of continual speech); for example, it is possible to use models with bigrams or trigrams of words or regular grammar.        
In the above model the neural networks enter into the third module as regards the acoustic pattern matching aspect, and are used for estimating the probability that a portion of speech signal belongs to a phonetic class in a set given a priori, or constitutes a whole word in a set of prefixed words.
Neural networks have an architecture that has certain similarities to the structure of the cerebral cortex, hence the name neural. A neural network is made up of many simple parallel computation units, called neurones, densely connected by a network of weighted connections, called synapses, that constitute a distributed computation model. Individual unit activity is simple, summing the weighted input from the interconnections transformed by a non-linear function, and the power of the model lies in the configuration of the connections, in particular their topology and intensity.
Starting from the input units, which are provided with data on the problem to solve, the computation propagates in parallel in the network up to the output units that provide the result. A neural network is not programmed to execute a given activity, but is trained using an automatic learning algorithm, by means of a series of examples of the reality to be modelled.
The MLP or Multi-Layer Perceptron model currently covers a good percentage of neural network applications to speech. The MLP model neurone sums the input weighting it with the intensity of the connections, passes this value to a non-linear function (logistic) and delivers the output. The neurones are organised in levels: an input level, one or more internal levels and an output level. The connection between neurones of different levels is usually complete, whereas neurones of the same level are not interconnected.
With specific regard to speech recognition neural networks, one recognition model in current use is illustrated in document EP 0 623 914. This document substantially describes a neural network incorporated in an automaton model of the patterns to be recognised. Each class is described in terms of left-right automatons with cycles on states, and the classes may be whole words, phonemes or other acoustic units. A Multi-Layer Perceptron neural network computes automaton state emission probability.
It is known however that neural network execution is very heavy in terms of the required computing power. In particular, a neural network utilises a speech recognition system like the one described in the aforementioned document has efficiency problems in its sequential execution on a digital computer due to the high number of connections to compute (for each one there is an input product for the weight of the connection), which can be estimated as around 5 million products and accumulations for each second of speech.
An attempt at solving this problem, at least in part, was made in document EP 0 733 982, which illustrates a method for accelerating neural network execution, for processing correlated signals. This method is based on the principle that, since the input signal is sequential and evolves slowly over time in a continuous manner, it is not necessary to re-compute all the activation values of all the neurones for each input, but it suffices to propagate the differences with respect to the previous input in the network. In other words, the operation is not based on absolute values of neurone activation at time t, but on the difference with respect to the activation at time t-1. Therefore at each point of the network, if a neurone has, at time t, activation sufficiently similar to that of time t-1, it does not propagate any signal forward, limiting the activity exclusively to those neurones with an appreciable change in activation level.
However, the problem remains, especially in the case of small vocabularies that use only a small number of phonetic units. Indeed, in known systems each execution of the neural network envisages computing of all output units, with an evident computation load for the system.