The present invention relates to a speech recognition system using fenonic Markov models, and particularly to such a system in which the vector quantization code book is easily and accurately adoptable.
Speech recognition using Markov models recognizes speech probabilistically. For example, in one method thereof speech is first frequency-analyzed for each of a series of time periods (called "frames") and vector quantized and then converted into a label (symbol) train. One Markov model is set for each label. On the basis of the label train registered with the speech, a Markov model train (word baseform) is given for each word.
Each Markov model has a plurality of states and transitions between the states. Each transition has a probability of occurrence. One or more labels may be output by each Markov model at each state or transition. Label output probabilities at each state or transition are assigned to the state or transition.
In the recognition process, unknown input speech is converted into a label train. The probabilities of producing this label train by the respective word Markov models specified by the word baseforms are determined based upon the foregoing transition probabilities and label output probabilities (called "parameters" hereinafter). The word Markov model having the maximum probability of producing the label train is the recognition result.
Such a Markov model for each label unit is called a "fenonic Markov model". The model made to correspond to the same label is treated as the common model at the time of the learning and recognition of the model. The details of fenonic Markov models are given in the following literature:
(1) "Acoustic Markov Models Used in the Tangora Speech Recognition System" (Proceedings of ICASSP'88. S11-3, L. R. Bahl, P. F. Brown, P. V. de Souza, R. L. Mercer and M. A. Picheny, pages 497-500, April 1988). PA0 (2) "Speaker adaptation by vector quantization" (Electronics and Communication Institute Technical Research Report, SP-86-65, pages 33-40, by Kiyohiro Shikano, December 1986). PA0 (3) "Speaker adaptation method without a teacher based upon clustering of spectrum space". (Japanese Acoustic Institute, Proceeding of Spring National Meeting of Showa 63, 2-2-16, by Sadaoki Furui, March 1988). PA0 (4) "Speaker Adaptation Method for HMM-Based Speech Recognition", (Proceedings of ICASSP'88, S5-7, by M. Nishimura and K. Sugawara, April 1988).
In speech recognition using the foregoing Markov model, a large amount of speech data is required for the preparation of the code book of the vector quantization, the estimation of the Markov model, and the registration of the word baseform, and much time is also required for these operations. Many systems prepared with the speech data of predetermined speakers may not give sufficient recognition accuracy for other speakers. Recognition accuracy is degraded when the environment becomes different due to the lapse of a relatively long time, even for the same speaker. There is also a problem when the recognition accuracy is degraded due to environmental noise.
In reference (1), although the learning time is greatly reduced by preparing word baseforms from predetermined speaker utterances, it still requires a large amount of speech data and much process time since the quantization code book and the parameters of the Markov model are reevaluated for each speaker.
Recently, in order to solve these problems it has been proposed that the vector quantization code book and the Markov model for the predetermined speaker be adapted to different speakers and circumstances. The adaptation methods of the vector quantization code book may be divided into the following two types.
The first is to determine the correspondence between the learning utterance and the predetermined speaker utterance by DP matching, and adapt the code book using it. This is disclosed in:
However, it is impossible to exactly determine the correspondence by this method when the distribution of the feature quantity changes greatly. Furthermore, it does not necessarily give the same evaluation as that on the Markov model because the correspondence is based upon the distance. It also results in degraded efficiency in the use of memory capacity, since DP processing is required in addition to Markov model processing.
The second method does not use the correspondence to the time axis, but prepares the adapted code book by clustering the learning speech with reference to one original code book. Such method is described in:
These methods require a large amount of calculations and memory capacity, and may not provide a highly accurate adaptation since all correspondences on the time axis are neglected.
In addition, the reference (4) discloses the adaptation of the Markov model parameters.