The invention relates to a speech recognition system utilizing Markov models, and more particularly to a system capable of highly accurate recognition without a significant increase in the amount of computation and the storage capacity.
Speech recognition utilizing Markov models operates probabilistically. For example, one such technique establishes a Markov model for each word. Generally, the Markov model is defined with a plurality of states and transitions between these states. Each transition from a state has a probability of occurrence. Each transition has a probability of producing a label (symbol) at the transition.
After being frequency-analyzed for a predetermined cycle (called a "frame"), the unknown input speech is converted into a label stream through vector quantization. Then, the probability of each of the word Markov models generating the label stream is determined on the basis of the above-mentioned transition occurrence probabilities, and the label output probabilities (called "parameters" hereinafter). The input speech is recognized as the word whose Markov model has the highest label generating probability.
According to speech recognition utilizing Markov models, the parameters may be statistically estimated, thereby improving the recognition accuracy. This recognition technique is detailed in the following papers:
(1) "A Maximum Likelihood Approach to Continuous Speech Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-5, No. 2, pp. 179-190, 1983, by Lalit R. Bahl, Frederick Jelinek and Robert L. Mercer. PA0 (2) "Continuous Speech Recognition by Statistical Methods," Proceedings of the IEEE, Vol. 64, 1976, pp. 532-556 by Frederick Jelinek. PA0 (3) "An Introduction to the Application of the Theory of Probabilistic Functions of the Markov Process of Automatic Speech Recognition," The Bell System Technical Journal, Vol 64, No. 4, pp. 1035-1074, Apr. 1983 by S.E. Levinson, L.R. Rabiner and M.M. Sondhi. PA0 (4) "Speech Recognition based Probability Model" Electronic Information Communication Institutes, 1988, Chapter 3, Section 3.3.5, pp. 79 -80, by Seiichi Nakagawa. PA0 (5) "HMM Based Speech Recognition Using Multi-Dimensional Multi-Labeling" Proceedings of ICASSP '87, Apr. 1987, 37-10 by Masafumi Nishimura and Koichi Toshioka. PA0 (6) "Acoustic Markov Models Used in the Tangora Speech Recognition System", Proceedings of ICASSP'88, Apr. 1988, S11-3, by L.R. Bahl, P.F. Brown, P.V. de Souza, R.L. Mercer and M.A. Picheny.
In speech perception, it has been noted that the transition spectral pattern of speech is an important characteristic for speech recognition, especially for consonant recognition, and is insensitive to noise. A characteristic of a typical Markov model is a lack of the ability to describe such transitional characteristics.
Recently, there have been proposed several Markov models representing such transitional characteristics of speech. However, these models consist of a large number of parameters, not only causing problems in the amount of storage, but also having a disadvantage in that they need a large amount of training speech data for estimating parameters.
For example, when it is intended to estimate models with a spectral pattern over adjacent m frames as the feature quantity, parameters of about N.sup.m for each state must be estimated, even when the label output probability is assigned to each state of the model where N is the number of the patterns for each frame (the number of the label prototypes for the vector quantization) If m is a large number, the model cannot be constructed because of the enormous amount of storage required, and the enormous amount of training speech needed for estimating the model parameters. Matrix quantization of the pattern over m frames may reduce the number of the parameters by some degree. However, the number cannot be significantly reduced because of the quantization error. This technique also has a disadvantage in that the amount of calculation and storage required for quantization becomes enormous.
A method directly taking the transitional pattern into the Markov model formulation has also been suggested. In this method, as the Markov model label output probability is P(L(t).vertline.L(t-1), L(t-2)... L(t-m), S), where L(m) and S represent a label and a state at a time t, respectively. In this technique, N.sup.m parameters must still be estimated. This is described in:
On the other hand, there is a method in which two types of vector quantization are performed, one for static spectrum for each frame and the other for the variation of the spectrum on a time axis to represent a transitional variation pattern of speech by the resultant pair of labels This is disclosed in:
Although, according to this method, the transitional variation pattern of speech may be expressed without a large increase of the amount of calculation and storage, for the vector quantization, about N.sup.2 parameters must be estimated for each state of the Markov model, when the number of patterns of each feature amount is N. It is still difficult to accurately estimate all the parameters with a small amount of speech data, and the amount of storage required is also large.