This invention relates to speech models using hidden Markov models in subword units, such as phones (or phonemes), and speech recognition using such speech models, and more particularly to enabling efficient speech recognition in response to pronunciation transformations (fluctuations).
Speech recognition utilizing Markov models is intended to perform speech recognition from the viewpoint of probability. In recent years, there have been systems proposed for large vocabulary speech recognition and continuous speech recognition based on hidden Markov models in subword units, such as phones (or phonemes) and syllables.
As a representative conventional method, there is a method of speech recognition such that phonetic hidden Markov models are combined in series to represent a word to be recognized. In this method, the choice of such phonetic hidden Markov models to be concatenated is made on the basis of a description (baseform) in a pronunciation dictionary of words to be recognized. However, since actual speech undergoes transformation depending on the types of preceding and subsequent phonemes, pronunciation speed, and accentuation, it is difficult to obtain a high recognition rate if phonetic hidden Markov models are concatenated without regard to such transformations.
Then, there is another method such that phonetic hidden Markov models are prepared for each phonetic environment (context) in consideration of only preceding and subsequent phonetic environments. Phonetic hidden Markov models are selected according to the phonetic environment described in a pronunciation dictionary of words to be recognized, and are then combined in series. This method is dealt with in detail in "Context-Dependent Modeling for Acoustic-Phonetic Recognition of Continuous Speech" (Proceedings of ICASSP '85, April 1985 R. Schwartz, Y. Chow, O. Kimball, S. Roucos, M. Krasner, J. Makhoul). Although this method can easily reflect a speech transformation for each phonetic environment, it should prepare a large number of phonetic hidden Markov models to handle various speech transformations because combinations of phonetic environments are extremely many, and it requires a large amount of training speech data.
Moreover, for speaker-independent speech recognition, where pronunciation fluctuations are markedly different from one speaker to another, this method would result in loose models because each single phonetic hidden Markov model is required to include all pronunciation fluctuations attributable to each speaker, resulting in a lowered ability to distinguish phonemes.
On the other hand, there is another method such that knowledge about transformations and fluctuations in speech for each word is represented by a combination of subword hidden Markov model networks. This method is dealt with in detail in "A Maximum Likelihood Approach to Continuous Speech Recognition" (IEEE Transactions on Pattern Analysis and Machine Intelligence, Volume PAMI-5, No. 2, pp. 179-190, March 1983, L. R. Bahl, F. Jelinek, R. L. Mercer).
However, it is not easy to manually prepare such a network representation for each word, and it is also not necessarily possible to precisely associate knowledge from human senses with individual physical phenomena.
Furthermore, there is another method such that parameters (transition probabilities) on a network are trained and determined for each word. Yet, this method requires a large amount of training speech data to obtain a network representation of each word, so it was not easy to modify words to be recognized in spite of subwords being adopted as units.