In communication, data processing and similar systems, a user interface using audio facilities is often advantageous especially when it is anticipated that the user would be physically engaged in an activity (e.g., driving a car) while he/she is operating one such system. Techniques for recognizing human speech in such systems to perform certain tasks have been developed.
In accordance with one such technique, input speech is analyzed in signal frames, represented by feature vectors corresponding to phonemes making up individual words. The phonemes are characterized by hidden Markov models (HMMs), and a Viterbi algorithm is used to identify a sequence of HMMs which best matches, in a maximum likelihood sense, a respective concatenation of phonemes corresponding to an unknown, spoken utterance. The Viterbi algorithm forms a plurality of sequences of tentative decisions as to what the uttered phonemes were. These sequences of tentative decisions define the so-called "survival paths." The theory of the Viterbi algorithm predicts that these survival paths merge to the "maximum-likelihood path" going back in time. See G. D. Forney, "The Viterbi Algorithm," Proceedings of the IEEE, Vol. 761, No. 3, March 1973, pp. 268-278. In this instance, such a maximum-likelihood path corresponds to a particular concatenation of phonemes which maximizes a cumulative conditional probability that it matches the unknown, spoken utterance given the acoustic input thereof.
In practice, in each state where a tentative decision is made, a state observation likelihood (SOL) measure, indicating the probability that a respective phoneme was uttered during the signal frame period, is derived from an HMM. As the tentative decisions are made along a sequence, the SOL measures are accumulated. Based on the respective cumulative SOL measures of the tentative decision sequences, a dynamic programming methodology is used to identify the maximum-likelihood phoneme concatenation corresponding to the unknown, spoken utterance.
The SOL measures may be derived from well-known continuous-density HMMs which offer high recognition accuracy. However, such a derivation requires intensive computations involving a large number of Gaussian kernels which are state dependent. As a result, the derivation incurs high computational cost, and substantial overheads in memory storage and access.
Attempts have been made to improve the efficiency of the derivation of the SOL measures. One such attempt involves use of tied-mixture HMMs, also known as semi-continuous HMMs. For details on the tied-mixture HMMs, one may refer to: X. Huang et al., "Semi-Continuous Hidden Markov Models for Speech Signals," Computer Speech and Language, vol. 3, 1989, pp. 239-251; and J. R. Bellegarda et al., "Tied Mixture Continuous Parameter Modeling for Speech Recognition," IEEE Trans. Acoustics Speech Signal Process, vol. 38, no. 12, 1990, pp. 2033-2045. Although they do not offer the recognition accuracy as high as the continuous-density HMMS, the tied-mixture HMMs all share the same collection of Gaussian kernels, which are state independent. As a result, among other things, less storage for the Gaussian kernels is required for the SOL derivation using the tied-mixture HMMs.