Neural networks are a parallel processing structure reproducing the cerebral cortex organization in very simplified form. A neural network is formed by numerous processing units, called neurons, strongly interconnected through links of different intensity, called synapses or interconnection weights. Neurons are generally organized according to a layered structure, comprising an input layer, one or more intermediate layers and an output layer. Starting from the input units, which receive the signal to be processed, processing propagates to the subsequent layers in the network up to the output units that provide the result. Various implementations of neural networks are described, for example, in the book by D.Rumelhart "Parallel Distributed Processing", Vol. 1--Foundations, MIT Press, Cambridge, Mass., 1986.
Neural network technique is applicable to many sectors and in particular to speech recognition, where a neural network is used to estimate probability P (Q.vertline.X) of a phonetic unit Q, given the parametrin representation X of a portion of the input speech signal. Words to be organized are represented as a concatenation of phonetic units and a dynamic programming algorithm is used to identify the word having the highest probability to be that being uttered.
Hidden Markov models are a classical speech recognition technique. A model of this type is formed by a number of states interconnected by the possible transitions. Transitions are associated with a probability of passing from the origin state to the destination state. Further, each state may emit symbols of a finite alphabet, according to a given probability distribution. In the case of speech recognition, each model represents an acoustic-phonetic unit through a left-to-right automaton in which it is possible to remain in each state with a cyclic transition or to pass to the next state. Furthermore, each state is associated with a probability density defined over X, where X represents a vector of parameters derived from the speech signal every 10 ms. Symbols emitted, according to the probability density associated with the state, are therefore the infinite possible parameter vectors X. This probability density is given by a mixture of Gaussian curves in the multidimensional space of the input vectors.
Also in case of hidden Markov models, words to be recognized are represented as a concatenation of phonetic units and use is made of a dynamic programming algorithm (Viterbi algorithm) to find out the word uttered with the highest probability, given the input speech signal.
More details about this recognition technique can be found e.g. in: L. Rabiner, B- H. Juang "Fundamentals of speech recognition", Prentice Hall, Englewood Cliffs, N.J. (USA).
The method of this invention makes use of both the neural network technique and the Markov model technique through a two-step recognition and a combination of the results obtained by means of both techniques.
A recognition system in which scores of different recognisers are combined to improve performance in terms of recognition accuracy is described in the paper "Speech recognition using segmental neural nets" by S.Austin, G.Zavaliagkos, J. Makhoul and R. Schwartz, presented at the ICASSP 92 Conference, San Francisco, March 23-26, 1992.
This known system performs a first recognition by means of hidden Markov models, providing a list of the N best recognition hypotheses (for instance: 20), i.e. of the N sentences that have the highest probability to be the sentence being actually uttered, along with their likelihood scores. The Markov recognition stage also provides for a phonetic segmentation of each hypothesis and transfers the segmentation result to a second recognition stage, based on a neural network. This stage performs recognition starting from the phonetic segments supplied by the first Markov step and provides in turn a list of hypotheses, each associated with a likelihood score, according to the neural recognition technique. Both scores are then linearly combined so as to form a single list, and the best hypothesis originating from such a combination is chosen as recognised utterance.
A system of this kind has some drawbacks. A first drawback is due to the second recognition step being performed starting from phonetic segments supplied by the first step: if segmentation is affected by time errors, the second step shall in turn produce recognition errors that propagate to the final list. Furthermore, such a system is inadequate for isolated word recognition within large vocabularies, since it employs as a first stage the Markov recognizer which under such particular circumstances is slightly less efficient than the neural one in terms of computational burden. Additionally, if one considers that the hypotheses provided by a Markov recognizer and a neural network recognizer show rather different score dynamics, a shear linear combination of scores may lead to results which are not significant. Finally, the known system does not supply any reliability information about the recognition effected.
Availability of said information in systems exploiting isolated word recognition is on the other hand a particularly important feature: as a matter of fact, these systems generally request the user to confirm the uttered word, thus causing a longer procedure time. If reliability information is provided, the system can request confirmation only when recognition reliability falls below a given threshold, speeding up the procedure with benefits for both the user and the system operator.