The goal of a speech recognition apparatus is to produce the word string w with the highest a posteriori probability of occurring: ##EQU1##
where
P(w) is the probability of any one particular word string;
P(A) is the probability of the acoustic feature vectors;
P(w.vertline.A) is the probability of a string of words given some acoustic feature vectors; and
P(A.vertline.w) is the probability of the acoustic feature vectors given the word string.
The term P(w) is generally known as the Language Model and is not relevant to the instant invention. The term P(A.vertline.w) is referred to as the Acoustic Model and relates to the computation of the probability that a given word will produce a string of acoustic features or parameters (,i.e. acoustic or speech feature vectors, or feature vectors). Putting it differently, the purpose of the Acoustic Model is to assign a probability that given a string of words, that string of words would produce a particular set of feature vectors. The present invention is directed to certain aspects of this Acoustic Model.
In brief, when one speaks, since pronunciation and emphasis on words do vary, no single string of speech feature vectors would always correspond to the spoken output from a speaker, particularly when the output from the speaker actually varies from day to day, and in fact from minute to minute. In addition, no one feature vector of an utterance will totally match a feature vector of a different utterance, even with regard to the same word.
Thus, there is a need for an Acoustic Model that will indicate the probability that a given word will produce a given set of feature vectors, for every possible word in the vocabulary. Putting it simply, there is a need for an apparatus and a method of computing the probability that a particular word in fact could produce a string of feature vectors, if that string of feature vectors were presented to the apparatus.
One of the most successful techniques for constructing Acoustic Models employs the use of Hidden Markov Models. The use of Hidden Markov Models is well known in the art of speech recognition and will not be described here. See, for example, A Maximum Likelihood Approach to Continuous Speech Recognition, Lalit R. Bahl et al., IEEE Transactions On Pattern Analysis and Machine Intelligence, Vol. PAMI-5, No. 2, March 1983, incorporated to this application by reference. However, do note one of interest key aspect of the use of the Hidden Markov Model technology which involves the replacement of each feature vector by a label drawn from a small number (typically less than 500) of possible labels. This reduces the amount of data that the Hidden Markov Model component must deal with and thus simplifies later computational and modelling stages of the recognizer.
Assigning a label to a feature vector in the prior art has been achieved by noting that groups of feature vectors in n-dimensional space divide that space into a number of convex regions. The values of the feature vectors in the regions are averaged such that each region is represented by a prototype and each feature vector extracted from speech is identified with the prototype in space to which it is the closest. The feature vector is accordingly labelled with an identifier of that prototype.
The problem with the prior art is that for each of the regions to which a prototype has been assigned, there are, in addition to the feature vectors which correspond to a particular sound, a number of additional feature vectors that are associated with other sounds. Consequently, the prior art speech recognition technique results in a large number of recognition errors. For example, sounds such as "s", "f" and "sh" are sometimes all given the same label.
Thus, some technique, and an apparatus for effecting such technique, is needed to impart some kind of speech knowledge into speech recognition in order to eliminate as much as possible the number of errors that are made with the prior art technique.