(1) Field of the Invention
The present invention relates to a speech recognition system using Markov models and in particular to a method for training the speech recognizer in the system.
(2) Description of the Prior Art
The best methods available for automatic machine recognition of speech are based on Hidden Markov Models (HMMs). HMMs are statistical models of the time variation, or temporal structure, of nonstationary time series such as spoken language. Applied to speech, the HMM methods have a training phase, in which the temporal structure of the different acoustic/phonetic speech components (e.g. phonemes, fricatives, etc.) are modeled by HMMs. Approximately 40 such speech units are used in spoken English. There are as many HMMs as there are speech acoustic/phonetic units, so that approximately M=40 HMMs need to be stored for spoken English. In the recognition phase, the speech signal is segmented by a separate process, and then, the previously developed HMMs are used to decide which speech component gave rise to each segment.
One of the state-of-the-art HMM structures for speech recognition is set forth in the article "Maximum-Likelihood Estimation for Mixture Multivariate Stochastic Observations of Markov Chains," by B. J. Juang, AT&T Technical Journal, 64(1985), pp. 1235-1240. The HMMs in this model use mixtures of multivariate Gaussian components to model each of the state distributions internal to the HMM structure. Typically, there are about S=10 states in each HMM and approximately G=12 Gaussian components per state. During the training phase, there are M.times.S.times.G=40.times.10.times.12=4800 covariance matrices that must be estimated (i.e. trained) and stored for later use in the speech recognition phase. The number of variates in the multivariate Gaussian components is typically on the order of L=10. Since a general covariance matrix of size L requires a minimum of L(L+1)/2+10.times.(10+1)/2 =55 floating point numbers, the total storage required by this approach is on the order of 55.times.4800=264,000 storage locations or one megabyte in a 32-bit computer. The required storage will vary as indicated with the size and number of HMMs and with the precision of the host computer's floating point number representation.
Two important limitations of Juang's fully heteroscedastic HMM structure for modeling the acoustic/phonetic units of speech are storage and training. Large storage requirements, together with the need for fast memory access times in the critical recognition phase, leads to increased hardware cost. Such cost is always an important factor in product marketability. For example, Juang in his computer code restricts himself to diagonal covariance matrices. See U.S. Pat. No. 4,783,804, issued Nov. 8, 1988. This restriction greatly decreases the storage requirements of fully heteroscedastic HMMs; however, Juang does not discuss this important issue.
The training limitation of fully heteroscedastic HMM structures may be even more important than the hardware costs in some product applications. Obtaining reliable statistical estimates of very large HMM parameter sets requires enormous amounts of data. In the example discussed above, 264,000 parameters specify the set of covariance matrices alone, and this does not include the mixing proportions and mean vectors required for each Gaussian component. Clearly, it is very difficult and time consuming to collect and process the extensive training sets required for estimating general heteroscedastic HMMs for each acoustic/phonetic unit. Not only does extensive data collection contribute to the final product cost, it also inhibits the ease of use of the speech recognizer product, especially in speaker adaptive recognition applications.
There are a number of patented speech recognition systems which employ hidden Markov models. One such system is illustrated in U.S. Pat. No. 4,852,180 to Levinson. In this system, Levinson uses a single Gaussian probability density function to model the random observation produced by a state in the Markov chain and a Gamma probability density function to model the length of time or duration the speech unit spends in this state of the Markov chain.
Another speech recognition system employing hidden Markov models is shown in U.S. Pat. No. 5,029,212 to Yoshida. This patent primarily deals with the recognition phase of speech recognition. The invention described therein is directed to a method of computing the likelihood that an observation is a particular speech unit. It uses discrete probability densities and not continuous probability density functions.
U.S. Pat. No. 5,031,217 to Nishimura uses vector quantization methods and discrete probability density functions in the hidden Markov models used to model speech units.
Accordingly, it is an object of the present invention to provide an improved method for training a speech recognizer.
It is a further object of the present invention to provide a method as above which has reduced storage requirements and which requires a reduced amount of training data.
It is yet a further object of the present invention to provide a method as above which has a reduced cost and enhanced consumer appeal and satisfaction.