The present invention relates to a speech recognition method using Markov models and more particularly to a speech recognition method wherein speaker adaptation and circumstantial noise adaptation can be easily performed.
In speech recognition using Markov models, speech is recognized from probabilistic viewpoints. In one method, for example, a Markov model is established for each word. Usually a plurality of states and transitions between the states are defined for each Markov model, and occurrence probabilities are assigned for each state transition, and further output probabilities of labels or symbols are assigned for each state or state transition. An inputted unknown speech is converted into a label string, and thereafter a probability of each word Markov model outputting the label string is determined based on the transition occurrence probabilities and the label output probabilities which are hereafter referred to parameters. Then the word Markov model having the highest probability of producing the label string is determined. The recognition is performed according to this result. In speech recognition using Markov models, the parameters can be estimated statistically so that a recognition score is improved.
The details of the above recognition technique are described in the following articles.
(1) "A Maximum Likelihood Approach to Continuous Speech Recognition" by Lalit R. Bahl, Frederick Jelinek and Robert L. Mercer (IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol. PAMI-5, No. 2, pp. 179-190, 1983).
(2) "Continuous Speech Recognition by Statistical Methods" by Frederick Jelinek (Proceedings of the IEEE Vol. 64, No. 4, 1976, pp. 532-556).
(3) "An Introduction to the Application of the Theory of Probabilistic Functions of a Markov Process to Automatic Speech Recognition" by S. E. Levinson, L. R. Rabiner and M. M. Sondi (The Bell System Technical Journal, Vol. 64, No. 4, April 1983, pages 1035-1074).
Speech recognition using Markov models however needs a tremendous amount of speech data and the training thereof requires much time. Furthermore a system trained with a certain speaker often does not get sufficient recognition scores for other speakers. Moreover, even with the same speaker, when there is a long time between the training and the recognition (that is, when there is a difference between the two circumstances), only poor recognition can be achieved. In addition, degradation of recognition accuracy due to circumstantial noise is another issue.
Recently adaptation of trained Markov models for a speaker or a circumstance is often proposed. These proposals are considered to be classified into the following two types.
In the first type of proposal, event frequencies used for estimating parameters of Markov models during initial training are reserved, and further event frequencies are obtained for adaptation data. Thereafter these event frequencies are interpolated to estimate new parameters. Such proposals are described in:
(4) "Speaker Adaptation for A Hidden Markov Model", by Kazuhide Sugawara, Masafumi Nishimura and Akihiro Kuroda (Proceedings of ICASSP '86, April 1986, 49-11, pp. 2667-2670).
(5) Japanese Patent Application No. 61-65030 (corresponding to U.S. patent application Ser. No. 025,257, filed Mar. 12, 1987, and European Patent Application 243,009).
These proposals however require utterance of all the subject words for adaptation, and consequently impose a burden on users in large vocabulary speech recognition. Further they require much more tremendous computation time.
In the second type of proposal, Markov models produced by initial training are modified according to relations between parameters. These proposals are described in:
(6) "Isolated Word Recognition Using Hidden Markov Models", by Kazuhide Sugawara, Masafumi Nishimura, Kouichi Toshioka, Masaaki Okochi and Toyohisa Kaneko (Proceeding of ICASSP '85, March 1985, 1-1, pp. 1-4).
(7) "Rapid Speaker Adaptation Using A Probabilistic Spectral Mapping" by Richard Schwartz, Yen-Lu Chow, Francis Kubala (Proceedings of ICASSP '87, March 1987, 15-3, pp. 633-636).
In the technique described in the article (6), dynamic programming (DP) matching is performed among labeled words and a confusion matrix of labels is produced according to relations between labels in respect of an optimum path. Then parameters of Markov models are modified using that confusion matrix. In this approach, DP-matching is required in addition to Markov models, so that storage efficiency is not good. Further, tremendous speech data is required for a confusion matrix having an enough accuracy.
The technique described in the article (7) directly introduces relation probabilities between labels into output probabilities of conventional Markov models. This approach requires forward/backward calculation and, as a result, tremendous computational cost and storage cost.
The following article described adaptation of features for vector quantization.
(8) "Speaker Adaptation Through Vector Quantization", by Kiyohiro Shikano (Transactions of the Institute of Electronics and Communication Engineers of Japan, December 1986, SP86-65, pp. 33-40).