Currently, the most successful techniques for speech recognition are based on probabilistic models known as hidden Markov models (HMMs). A Markov chain comprises a plurality of states, wherein a transition probability is defined for each transition from each state to every other state, including self transitions. For example, referring to FIG. 1, a coin can be represented as having two states: a head state, labeled by `h` and a tail state labeled by `t`. Each possible transition from one state to the other state, or to itself, is indicated by an arrow). For a single fair coin, the transition probabilities are all 50% (i.e., the tossed coin is just as likely to land heads up as tails up). Of course, a system can have more than two states. For example, a Markov system of two coins, wherein the second coin is biased so as to provide 75% heads and 25% tails, can be represented by four states, labeled HH, HT, TH, and TT. In this example, since each state is labeled by an observed state, an observation is deterministically associated with a unique state.
In a hidden Markov model, an observation is probabilistically associated with a unique state. As an example, consider a HMM system of three coins, represented by three states, each state corresponding to one of the three coins. The first coin is fair, having equal probability of heads and tails. The second coin is biased 75% towards heads, and the third coin is biased 75% towards tails. Assume the probability of transitioning from any one of the states to another state or the same state is equal, i.e., the transition probabilities between the same or another state are each one third. Since all of the transition probabilities are the same, if the sequence H,H,H,H,T,H,T,T,T,T is observed, the most likely state sequence is the one for which the probability of each individual observation is maximum. Thus, the most likely state sequence is 2,2,2,2,3,2,3,3,3,3 since an observed H is most likely to be a result of the toss of coin 2, while each T is most likely to result from a toss of coin 3. However, if the transition probabilities are not all the same, a more powerful technique, such as the Viterbi algorithm, is required and can be advantageously employed.
Referring to FIG. 2, a sequence of state transitions can be represented as a path through a trellis that represents all of the states of the HMM over a sequence of observation times. Thus, given an observation sequence, the most likely sequence of states in the HMM, i.e., the most likely path through the trellis, can be determined using the Viterbi algorithm.
In a hidden Markov model, each observation is probabilistically associated with a state according to a measure of probability, such as a continuous probability density. Thus, even if the state of the system is known with complete certainty at any one instant of time, an observation is still conditioned according to the probability density associated with the state. Again, given a sequence of observations, the Viterbi algorithm can be used to determine the most likely sequence of states, which is commonly represented as the most likely path through the trellis constructed from the states.
Speech can be viewed as being generated by a hidden Markov process. Consequently, HMMs can be used to model an observed sequence of speech spectra, where specific spectra are probabilistically associated with a state in an HMM. Therefore, for a given observed sequence of speech spectra, there is a most likely sequence of states in a corresponding HMM. Further, if each distinct sequence of states in the HMM is associated with a sub-word unit, such as a phoneme, then a most likely sequence of sub-word units can be found. Moreover, using models of how sub-word units combine to form words, and language models of how words combine to form sentences, complete speech recognition can be achieved to a high degree of certainty.
For example, referring to FIG. 3, a phonetic HMM of a phoneme is represented by a state diagram, where the phoneme may be represented by a network of states, as shown. The number of states may vary depending upon the phoneme, and the various paths for the same phoneme may include different numbers of states. States 16 and 18 are pseudo-states that indicate the beginning and end of the phoneme, respectively. Each of the states 20, 22, 24 is associated with a probability density, i.e., a probability distribution of possible acoustic vectors, such as Cepstral vectors, that correspond to that state. State transition probabilities are determined for transitions between pairs of states, between a state and a pseudo-state, and for self transitions, all transition probabilities being indicated as arrows shown in FIG. 3. Pseudo-states are included to facilitate or simplify the organization of the overall HMM for speech, and are not essential to the model.
Since the probability densities of adjacent phonemes often overlap, as shown in FIG. 4, any given acoustic vector can be associated with more than one state. In FIG. 4, the horizontal axis represents acoustic vectors, and is more specifically related to the power spectrum (with each value modeled as a short vector with, for example, 14 dimensions), while the vertical axis is the probability density. For example the probability densities 32, 34, and 36 overlap, even though they correspond to distinct acoustic states. Consequently, a sequence of acoustic vectors cannot be deterministically mapped to a sequence of acoustic states in the phonetic HMM. Thus, phoneme recognition involves finding the most likely sequence of acoustic states of a phonetic HMM that is consistent with a sequence of acoustic vectors.
The phonetic HMM is developed using supervised learning during a "training" phase. Speech sound and associated phoneme labeling is presented to a speech learning module that develops the phonetic HMM. The density distribution associated with each acoustic state of a phonetic HMM is determined by observing many samples of the phoneme to be modeled. The various state transition probabilities and probability densities associated with each state are adjusted by the speech learning module in accordance with the many different pronunciations that are possible for each phoneme.
An HMM of a word includes a network of phoneroes. Just as there can be more than one state path in a phonetic HMM for representing multiple acoustic sequences that are to be considered as the same phoneme, an HMM of a word can have multiple phonetic sequences, for example, as shown in FIG. 5, for representing multiple pronunciations of the same word. In FIG. 5, each of the two pronunciations of the word is represented by a sequence of states, where the number of states may vary depending on the word and the various paths for the same word may include different numbers of states. Each of the states 38 through 50 is a phonetic HMM, as shown in FIG. 3. The word HMM of FIG. 5 also includes two pseudo-states 52 and 54 that indicate the beginning and end of the word, and are also included to enhance the organization of the overall HMM, and are not essential to the model.
To improve the accuracy of word recognition, language models are often used in conjunction with acoustic word HMMs. A language model specifies the probability of speaking different word sequences. Examples of language models include N-gram probabilistic grammars, formal grammars, such as context free or context dependent grammars, and N-grams of word classes (rather than words). For large vocabulary speech recognition, the N-gram probabilistic grammars are most suited for integration with acoustic decoding. In particular, bigram and trigram grammars, where N is 2 and 3, respectively, are most useful.
Thus, an HMM of speech can be hierarchically constructed, having an acoustic level for representing sub-word units, such as phonemes, a sub-word level for representing words, and a language model level for representing the likely sequences of words to be recognized. Nevertheless, the HMM of speech can be viewed as consisting solely of acoustic states and their associated probability densities and transition probabilities. It is the transition probabilities between acoustic states, and optionally pseudo-states, that embody information from the sub-word level and the language model level. For example, each state within a phoneme typically has only two or three transitions, whereas a pseudo-state at the end of a phoneme may have transitions to many other phonemes.
During a training phase, all parameters of the sub-word model and the language model are estimated. Specifically, training starts with a substantial amount of transcribed speech. From this speech and its transcription, the parameters of a corresponding hidden Markov model are estimated.
To determine the transition probabilities between states at the grammar level, the most likely sequence of words corresponding to an acoustic speech signal must be determined. In principle, the ideal way to find this most likely sequence of words is by considering every possible sequence of words.
For each sequence of words `W,` to compute the probability of that sequence given the observed acoustic speech signal `A`, it is useful to employ Bayes' rule, wherein the probability P(W.linevert split.A) of the word sequence W given the acoustic sequence A is factored into three parts: EQU P(W.linevert split.A)=P(W)*P(A.linevert split.W)/P(A) (1)
wherein P(W) is the probability of the hypothesized sequence of words W, P(A/W) represents the acoustic model, being the probability of the observed sequence given the word sequence and P(A) is the probability of the observed acoustic sequence A.
Then, to solve the speech recognition problem, the word sequence W for which the probability P(W.linevert split.A) is highest is found. Since A is the same for all hypothesized word sequences, P(A) in the denominator of equation (1) can be ignored, since it does not affect the relative ranking of the hypothesized word sequences. So, in practice, we choose the string W, for which P(W)*P(A.linevert split.W) is highest. Generally, P(W) is referred to as the language model, which expresses the a priori probability of each possible sequence of words W. P(W) is estimated by compiling statistics on a large body of text, which as previously mentioned, can be established through a training phase.
While the equation for speech recognition, as stated above as equation (1), is theoretically complete, a practical solution is not suggested by the equation for P(W.linevert split.A) alone. First, it is not feasible to determine P(W) for all possible sequences of words, since the number of possible word sequences is extremely large, growing exponentially with the size of the vocabulary V, i.e., as V.sup.L, where V is the number of possible words in the vocabulary, and L is the number of words spoken in a particular sequence. For example, given a vocabulary of 10,000 words, there are 10.sup.20 (10,000.sup.5) possible 5 word sequences. Finally, even if these probabilities could reasonably be estimated, the exhaustive search over all possible word sequences for the most likely word sequence would require prohibitively large amounts of computation. Similarly, establishing the probability of each possible acoustic sequence, given any word sequence, is an intractable problem.
Consequently, in practical speech recognition, simplifying assumptions are made so as to estimate the above probabilities, and then efficient search algorithms are used to find the most likely word sequence without considering each complete sequence explicitly.
The estimation of the probabilities is accomplished by making certain reasonable assumptions regarding the independence of sub-sequences of words with respect to other words within the complete sequence. Thus, for example, for the language model, it can be assumed that the probability of the entire sequence of words can be approximated by a limited order Markov chain model which assumes that the probability of each word in the sequence, for example, depends only on the previous one or two words, and the probability of the entire word sequence can be approximated as a product of these independent probabilities. In the case of a bigram grammar, P(W)=P(w.sub.1,w.sub.2, w.sub.3 . . . w.sub.n), the probability of a word sequence (w.sub.1,w.sub.2, w.sub.3 . . . w.sub.n) is approximated by: EQU P(w.sub.1)*.pi.{i=2,N}P(w.sub.i .linevert split.w.sub.i-1) (2)
wherein the probability of the first word (w.sub.1) is multiplied by the product of the probability of each subsequent word w.sub.i given the previous word w.sub.i-1 ; and in the case of a trigram grammar, P(W)=P(w.sub.1, w.sub.2, w.sub.3 . . . w.sub.n) is approximated by: EQU P(w.sub.1)*P(w.sub.2 .linevert split.w.sub.1)*.PI.{i=3,N}P(w.sub.i .linevert split.w.sub.i-1,w.sub.i-2) (3)
wherein the probability of the first word P(w.sub.1) is multiplied by the probability of the second word given the first word P(w.sub.2 .linevert split.w.sub.1) times the product of the probabilities of each subsequent word w.sub.i given the previous two words w.sub.i-1 and w.sub.i-2.
The bigram and trigram grammars are only two of many different available language models in which the language model probability can be factored into a plurality of independent probabilities. The problem of searching the vast space of possible word sequences for the most likely one is made easier because of the independence assumptions relating to the independence of sub-sequences of words. For example, when a bigram language model is used, the complexity of the search is linear in V and in L.
For the acoustic model, similar simplifying assumptions can be made. The acoustic realization of each phoneme is known to depend substantially on preceding and subsequent phonemes, i.e., it is context-dependent. Typically, the word error rate (the percentage of words that are misrecognized) is halved when context-dependent models are used. For example, a triphone model of a phoneme depends on three phonemes; the phoneme itself and both the immediately preceding and immediately following phonemes. Thus, a triphone model assumes and represents the fact that the way a phoneme is pronounced depends more on its immediate neighboring phonemes than on other more temporally distant words or phonemes. Incorporating triphone models of phoneroes in the HMM for speech significantly improves its performance.
Thus, the probability of an acoustic sequence given a word sequence W is given by: EQU P(A.linevert split.W)=P(a.sub.1 . . . a.sub.T .linevert split.W.sub.1 . . . W.sub.n) (4)
which is approximated by EQU .PI.{i=1,N}P(a.sub.i .linevert split.ph.sub.i-1,ph.sub.i,ph.sub.i+1)(5)
wherein a.sub.i is the acoustic observation sequence that is attributed to phoneme ph.sub.i, and wherein the product (5) is the product of the conditional probabilities of each acoustic subsequence given the preceding, current, and succeeding phoneme.
Biphone models are also possible, where a phoneme is modeled as being dependent on only the preceding or following phoneme. For example, a phoneme model that depends on the preceding phoneme is called a left-context model, while a phoneme model that depends on the succeeding phoneme is called a right-context model.
When actually processing an acoustic signal, the signal is sampled in sequential time intervals called frames. The frames typically include a plurality of samples and may overlap or may be contiguous. Nevertheless, each frame is associated with a unique time interval, and with a unique portion of the speech signal. The portion of the speech signal within each frame is spectrally analyzed to produce a sequence of acoustic vectors. During training, the acoustic vectors are statistically analyzed to provide the probability density associated with each state in the phonetic HMM models. During recognition, a search is performed for the state sequence most likely to be associated with the sequence of acoustic vectors.
To find the most likely sequence of states corresponding to a sequence of acoustic vectors, the Viterbi algorithm is employed. In the Viterbi algorithm, computation starts at the first frame and proceeds one frame at a time in a time-synchronous manner. At each frame, a probability score .alpha. is computed for each state in the entire HMM for speech. The score .alpha. is the joint probability, i.e., the product of the individual probabilities, of all of the observed data up to the time of the frame, and the state transition sequence ending at the state. The score .alpha. at state i and time t is thus given by: EQU .alpha.(i,t)=MAX{K}P(S(i,t,k), A,) (6)
wherein S(i,t,k) is the kth state sequence that begins at an initial state s.sub.1 at time 1, and ends at a state s.sub.i and time t, and A.sub.t is the sequence of acoustic observations a.sub.1 . . . a.sub.t from time 1 to time t.
The above joint probability can be factored into two terms: the a priori probability of the particular state sequence p(s(i,t)), and the conditional probability of the acoustic observation sequence A.sub.t given that state sequence P(A.sub.t .linevert split.s(i,t)): EQU .alpha.(i,t)=MAX{k}P(s(i,t))*P(A.sub.t .linevert split.s(i,t))(7)
Thus, when analyzing an acoustic signal, a cumulative .alpha. score is successively computed for each of the possible state sequences as the Viterbi algorithm analyzes the acoustic signal frame by frame. By the end of the utterance, the sequence having the highest .alpha. score produced by the Viterbi algorithm provides the most likely state sequence for the entire utterance. The most likely state sequence can then be converted into the corresponding spoken word sequence.
The independence assumptions used to facilitate the acoustic and language models also facilitate the Viterbi search. According to the Markov independence assumption, since the number of states corresponds to the number of independent parts of the model, the probability of the present state at any time depends only on the preceding state. Similarly, the probability of the acoustic observation at each time frame depends only on the current or present state. This leads to the familiar iteration used in the Viterbi algorithm: EQU .alpha.(i,t)=[MAX{j}.alpha.(j,t-1)*P(i.linevert split.j)]*P(x(t).linevert split.i) (8)
wherein P(i.linevert split.j) is the probability of transition to state i given state j, and P(x(t).linevert split.i) is the conditional probability of x(t), the acoustic observation x at time t, given state i.
This algorithm is guaranteed to find the most likely sequence of states through the entire HMM given the observed acoustic sequence. Theoretically, however, this does not provide the most likely word sequence, because the probability of the input sequence given the word sequence is correctly computed by shunning the probability over all possible state sequences belonging to any particular word sequence. Nevertheless, the Viterbi technique is most contrarily used because of its computational simplicity.
Thus, the Viterbi algorithm reduces an exponential computation to one that is proportional to the number of states and transitions in the model and the length of the utterance. However, for a large vocabulary and grammar, the number of states and transitions becomes large and the computation needed to update the probability score .alpha. at each state in each frame for all possible state sequences takes many times longer than the duration of one frame, which typically is about 10 ms in duration.
A technique called "beam searching" or "pruning" has been developed to greatly reduce the computation needed to determine the most likely state sequence by avoiding computation of the .alpha. probability score for state sequences that are very unlikely. This is accomplished by comparing, at each frame, each score .alpha. with the largest score .alpha. of that frame. As the .alpha. scores for the various state sequences are being computed, if the score .alpha. at a state for a particular partial sequence is sufficiently low compared to the maximum computed score at that point of time, it is assumed to be unlikely that the lower scoring partial state sequence will be part of the completed most likely state sequence. In theory, this method does not guarantee that the most likely state sequence will be found. In practice, however, the probability of a search error can be made extremely low.
Comparing each .alpha. score with the largest .alpha. score is accomplished by defining a minimum threshold wherein any partial state sequence having a score falling below the threshold is rendered inactive. The threshold is determined by dividing the largest score of a frame by a "beamwidth", which is obtained empirically so as to maximize the computational savings while minimizing error rate. For example, in a typical recognition experiment with a vocabulary of 20K words and a bigram grammar, the beam search technique reduces the number of "active" states (those states for which we perform the state update) in each frame from about 500,000 to about 25,000; thereby reducing computation by a factor of about 20. However, this number of active states is still much too large for real time operation, even when a beam search is employed.
Another well-known technique for reducing computational overhead is to represent the HMM of speech as a tree structure wherein all of the words likely to be encountered reside at the ends of branches or at nodes in the tree. Each branch represents a phoneme, and is associated with a phonetic HMM. All the words that share the same first phoneme share the same first branch, all words that share the same first and second phonemes share the same first and second branches, and so on. For example, the phonetic tree shown in FIG. 6 includes sixteen different words, but there are only three initial phonemes. In fact, the number of initial branches cannot exceed the total number of phonemes (about 40), regardless of the size of the vocabulary.
It is possible to consider the beginning of all words in the vocabulary, even if the vocabulary is very large, by evaluating the probability of each of the possible first phonemes--typically around 40 phones. Using an approach like the beam search, many of the low-probability phoneme branches can be eliminated from the search. Consequently, at the second level in the tree, which has many more branches, the number of hypotheses is also reduced. Thus, all of the words in the vocabulary can be considered, while incurring a computational cost that grows only logarithmically with the number of words, rather than linearly. This is particularly useful for recognizing speech based upon very large vocabularies.
However, there are several limitations imposed when a phonetic tree is used. For example, if two words share the same first phoneme, but have a different second phoneme, then the first triphones of the two words are different. One possible solution is to construct a tree using triphones rather than phoneroes. However, this greatly reduces the computational savings introduced by using a tree, since the number of unique branches at each level in the triphone tree would be equal to the number of branches at the following level in the simple phonetic tree.
Also, it is not possible to perform an exact bigram search with a single instantiation of a phonetic tree. In a bigram search, each state transition represents a pair of words, one word from the initial or previous state, and one word from the final or present state. Each pair of words indexes a bigram probability. By contrast, each state in a single instantiation of a phonetic tree is part of many different words that share at least one phoneme. Thus, the final grammar state of a bigram state transition to a state in a single phonetic tree is thereby indeterminate.
Further, the optimal Viterbi search algorithm requires that a separate copy of the path score be kept for every state in the entire HMM of speech, whereas for a single instantiation of a phonetic tree, since each state is associated with many words, many copies of the path score must be stored in each state.
In an attempt to solve this problem, Ney and Steinbiss (Arden House 1991, IEEE International Conference on Acoustics Speech and Signal Processing 1992) use an approach which can be termed a "forest search", wherein a separate phonetic tree is used for words following each different preceding word for bigram modeling. Each phonetic tree can therefore represent all possible present words and final grammar states of a bigram state transition (the present word) from the word ending state of each of a plurality of initial grammar states (the previous word). Thus, each state of any one of the phonetic trees is used following only a single preceding word. Consequently, the optimal Viterbi search can be employed, since a separate copy of the path score can be kept for every state in the entire HMM model.
However, the bigram probability for the word of the phonetic tree, given the word ending state of the previous word, can be applied only at the end of the word of the final state, because the identity of the word of the final state is not known until its last phoneme is reached.
Also, in principle, using a separate phonetic tree to represent the words following each of a plurality of preceding grammar states of a bigram state transition can result in as many trees as there are words in the vocabulary. However, in practice, a beam search is used to eliminate all but the most likely word ending states. That is, the scores of most word ending states are very low. Ney and Steinbiss report that upon each frame, there are typically only about ten words with word ending states having a sufficiently high score. As a result, there are typically only thirty trees with active states.
The states that are active in the different trees may not be the same. Thus, the total number of active states is typically between 10-30 times the number of active states in each tree. This means that much of the savings from using a tree is offset by duplicating the computation for several states that are in common among all the trees. Recall that in the original Viterbi algorithm, each state requires computation only once in each frame.
Ney and Steinbiss report a further computational savings by using a fast match algorithm for each phoneme upon each frame. The phoneme fast match looks at the next few frames of the speech to determine which phonemes match reasonably well. When only context-independent phonetic models are used, this information can be used throughout each of the phonetic trees to predict which paths will result in high scores. However, as stated above, using only context-independent phonetic models results in twice the word error relative to using context-dependent phonetic models. This approach reduces computation by a factor of three. The same computational savings could be obtained if phonetic trees were not used. In fact, the computational savings obtained by using multiple trees--relative to a beam search--is only a factor of five.
According to the technique of Ney and Steinbiss, the "current" word for any state is not known for any but the final states in a phonetic tree. Consequently, the bigram probability of the current word given the previous word(s) cannot be used until the final state of a word is reached. As a result, the pruning of states within a phonetic tree cannot benefit from the grammar score, and must depend solely on the phonetic information of the tree.
In summary, the simple use of multiple phonetic trees suffers from several deficiencies: full triphone acoustic models cannot be used; grammar probabilities cannot be applied until the last state of a word is reached; and computation must be repeated over many trees having the same active states.