The present invention relates to computer speech recognition. More particularly, the present invention relates to computer speech recognition using a dynamically configurable acoustic model in the speech recognition system.
The most successful current speech recognition systems employ probabilistic models known as hidden Markov models (HMMs). A hidden Markov model includes a plurality of states, wherein a transition probability is defined for each transition from each state to every states, including transitions to the same state. An observation is probabilistically associated with each unique state. The transition probabilities between states (the probabilities that an observation will transition from one state to the next) are not all the same. Therefore, a search technique, such as a Viterbi algorithm, is employed in order to determine a most likely state sequence for which the overall probability is maximum, given the transition probabilities between states and the observation probabilities.
A sequence of state transitions can be represented, in a known manner, as a path through a trellis diagram that represents all of the states of the HMM over a sequence of observation times. Therefore, given an observation sequence, a most likely path through the trellis diagram (i.e., the most likely sequence of states represented by an HMM) can be determined using a Viterbi algorithm.
In current speech recognition systems, speech has been viewed as being generated by a hidden Markov process. Consequently, HMMs have been employed to model observed sequences of speech spectra, where specific spectra are probabilistically associated with a state in an HMM. In other words, for a given observed sequence of speech spectra, there is a most likely sequence of states in a corresponding HMM.
This corresponding HMM is thus associated with the observed sequence. This technique can be extended, such that if each distinct sequence of states in the HMM is associated with a sub-word unit, such as a phoneme, then a most likely sequence of sub-word units can be found. Moreover, using models of how sub-word units are combined to form words, then using language models of how words are combined to form sentences, complete speech recognition can be achieved.
When actually processing an acoustic signal, the signal is typically sampled in sequential time intervals called frames. The frames typically include a plurality of samples and may overlap or be contiguous. Each frame is associated with a unique portion of the speech signal. The portion of the speech signal represented by each frame is analyzed to provide a corresponding acoustic vector. During speech recognition, a search is performed for the state sequence most likely to be associated with the sequence of acoustic vectors.
In order to find the most likely sequence of states corresponding to a sequence of acoustic vectors, an acoustic model is accessed and the Viterbi algorithm is employed. The Viterbi algorithm performs a computation which starts at the first frame and proceeds one frame at a time, in a time-synchronous manner. A probability score is computed for each state in the state sequences (i.e., the HMMs) being considered. Therefore, a cumulative probability score is successively computed for each of the possible state sequences as the Viterbi algorithm analyzes the acoustic signal frame by frame, based on the acoustic model. By the end of an utterance, the state sequence (or HMM or series of HMMs) having the highest probability score computed by the Viterbi algorithm provides the most likely state sequence for the entire utterance. The most likely state sequence is then converted into a corresponding spoken subword unit, word, or word sequence.
The Viterbi algorithm reduces an exponential computation to one that is proportional to the number of states and transitions in the model and the length of the utterance. However, for a large vocabulary, the number of states and transitions becomes large and the computation required to update the probability score at each state in each frame for all possible state sequences takes many times longer than the duration of one frame, which is typically approximately 10 milliseconds in duration.
Thus, a technique called pruning, or beam searching, has been developed to greatly reduce computation needed to determine the most likely state sequence. This type of technique eliminates the need to compute the probability score for state sequences that are very unlikely. This is typically accomplished by comparing, at each frame, the probability score for each remaining state sequence (or potential sequence) under consideration with the largest score associated with that frame. If the probability score of a state for a particular potential sequence is sufficiently low (when compared to the maximum computed probability score for the other potential sequences at that point in time) the pruning algorithm assumes that it will be unlikely that such a low scoring state sequence will be part of the completed, most likely state sequence. The comparison is typically accomplished using a minimum threshold value. Potential state sequences having a score that falls below the minimum threshold value are removed from the searching process. The threshold value can be set at any desired level, based primarily on desired memory and computational savings, and a desired error rate increase caused by memory and computational savings.
Another conventional technique for further reducing the magnitude of computation required for speech recognition includes the use of a prefix tree. A prefix tree represents the lexicon of the speech recognition system as a tree structure wherein all of the words likely to be encountered by the system are represented in the tree structure.
In such a prefix tree, each subword unit (such as a phoneme) is typically represented by a branch which is associated with a particular phonetic model (such as an HMM). The phoneme branches are connected, at nodes, to subsequent phoneme branches. All words in the lexicon which share the same first phoneme share the same first branch. All words which have the same first and second phonemes share the same first and second branches. By contrast, words which have a common first phoneme, but which have different second phonemes, share the same first branch in the prefix tree but have second branches which diverge at the first node in the prefix tree, and so on. The tree structure continues in such a fashion such that all words likely to be encountered by the system are represented by the end nodes of the tree (i.e., the leaves on the tree).
It can be seen that several of the above-described techniques are attempts to simplify and streamline computation in a speech recognition system. However, a computationally intensive computer system is still required in order to achieve a reasonably high degree of accuracy and real time response in performing the speech recognition task.
One portion of the speech recognition system which requires a high degree of computational resources is the acoustic model, and the process by which the acoustic model is accessed in order to determine a likely output corresponding to an input utterance.
One acoustic model which has been used in the past includes a plurality of senones. Development of senones is described in greater detail in Hwang, M. and Huang, X., "SUBPHONETIC MODELING WITH MARKOV-STATES SENONE", IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. I, 1992, pp. 33-36; and Hwang, M., Huang, X. and Alleva, F., "PREDICTING TRIPHONES WITH SENONES", IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. II, 1993, pp. 311-314.
Briefly, a senone tree is grown for each Markov state in each hidden Markov model used to model a speech unit. The parameters in the acoustic model associated with each Markov state are grouped, or clustered, based upon answers to a plurality of linguistic questions with a hierarchy arranged in a tree format. The resultant tree ends in leaves which include grouped or clustered parameters referred to as senones. There may typically be one senone tree in the speech recognition system for every Hidden Markov state in every phoneme (or other phonetic sub-word unit). That may typically result in approximately 120 senone trees.
Where discrete Hidden Markov models, or semi-continuous Hidden Markov models, are used, each leaf in the senone tree is represented by a single, discrete output distribution with n entries. For continuous Hidden Markov models, with a mixture of Gaussian density functions, each leaf on the senone tree is represented by m weighted Gaussian density functions. Each Gaussian density function is, in turn, parameterized by its mean vector and its covariance matrix. The acoustic model is typically trained using a maximum likelihood training technique, such as the Baum-Welch technique utilizing a corpus of training data.
In a relatively large, highly accurate, research speech recognition system, the senones in the acoustic model include approximately 120 k Gaussians (including means and covariances) which consume approximately 30 megabytes of memory.
However, such an acoustic model is typically much too large to be practically implemented on many conventional desktop computers. In order to provide a speech recognition system of practical size, which requires practical computational resources in terms of memory and speed, smaller and simpler acoustic models have been provided. The smaller and simpler acoustic models have traditionally been retrained from the raw training corpus and supplied to the user. This has typically been done by the developer of the speech recognition system and the simpler and smaller acoustic model has been provided in its final form to the eventual user. The reason this has typically been done by the developer is that the raw training corpus is a very large data corpus. Also, training an acoustic model based on such a corpus can be very computationally intensive. Thus, a typical user's system is not configured to handle such a large raw training corpus or to handle complete retraining of an acoustic model based on that corpus.
However, having the developer train the smaller acoustic model and provide it to the eventual user reduces flexibility. For instance, many users may wish to allocate a higher percentage of their available computational resources to a speech recognition task. Further, the eventual users will typically not all have the same, or maybe not even similar, system configurations in terms of available memory capacity and processor speed. Therefore, a user who has many computational resources, and wishes to trade those for increased speech recognition accuracy, cannot do so. By the same token, a user who has quite limited computational resources and wishes to trade off accuracy in order to conserve available computational resources, cannot do so.