In speech recognition it is well known to represent recognition vocabulary as a network of interconnected nodes. Branches between nodes may be parts of words, phonemes or allophones. Allophone models are context dependent phoneme branches between nodes may be parts of words phonemes or models. The allophones and phonemes are often represented by Hidden Markov Models (HMM). Thus any vocabulary word can be represented as a chain of concatenated HMM. Recognition of an unknown utterance involves the computation of the most likely sequence of states in the HMM chain. For medium to large vocabulary speech recognition systems, this computation represents a very large computational load.
The well known Viterbi method evaluates the probabilities for the vocabulary network by establishing a trellis. There is a trellis associated with each branch in the vocabulary network. The trellis having as its axes frame number as the abscissa and model state as the ordinate. The trellis has as many states associated with it as the number of states in the corresponding allophone model. For example, a ten-state allophone model will have ten states associated with every branch in the vocabulary network with that label. The total number of operations per frame for each trellis is proportional to the total number of transitions in the corresponding model. Thus, in the ten-state allophone model with 30 transitions, the total number of operations involved in the Viterbi method is about 50 (30 sums for estimating 30 transitions plus 20 maximums to determine a best transition at each state).
The well known Viterbi method can be used for finding the most likely path through the vocabulary network for a given utterance. However, there are two problems associated with the Viterbi method. First, the method is computationally complex because it evaluates every transition in every branch of the entire vocabulary network and, therefore, hardware cost is prohibitive or expensive. Computational complexity translates into cost/channel of speech recognition. Second, the Viterbi method provides only a single choice, and to provide alternatives increases computation and memory requirement even further. A single choice also eliminates the option of providing post processing refinements to enhance recognition accuracy.
Proposals to reduce the computational load resulting from these 50 operations per model have been put forward. Bahl et al. (1989) (A Fast Approximate Acoustic Match for Large Vocabulary Speech Recognition. Proceedings of Eurospeech 89: European Conference on Speech Communication and Technology. Paris: 156-158), propose reordering the computation for each transition by using a single transition probability for each HMM model (picking the transition with the highest likelihood). Thus, instead of adding different log observation probabilities at the three possible trellis transitions and then taking the maximum, the maximum is taken first over the three trellis values, then the log observation probability is added. This reduces the computation from 5 to 3 per transition or from 50 to 30 for the 10-state model. This is a reduction in computation, but it is not sufficient to allow a response after an acceptable delay.
Another proposal by Bahl et al. (1992) (Constructing Candidate Word Lists Using Acoustically Similar Word Groups. IEEE Transactions on Signal Processing, Vol. 40, 11:2814-2816), also attempts to reduce this computational load. This scheme uses a three-state model, rather than a more complex topology, for initial scoring with the Viterbi method, then uses the complex topology for rescoring. This proposal may actually increase the computational load. For example, if the retrained three-state model has as many mixtures as the complex topologies, then equal numbers of log observation probabilities must be computed twice, once for the three-state models and once for the complex topologies. Total memory requirements for storing the two sets of models would also increase.
The time taken to find the most likely path, thereby matching a vocabulary word to the unknown utterance, becomes the recognition delay of the speech recognition system. To be able to respond within acceptable delays using cost effective hardware computing platforms requires a less complex recognition method. Such a method must provide a reduction in the computational burden and the consequent time delay, without sacrificing accuracy of the recognition, represents a significant advance over the state of the art.