The present invention is directed to speech recognition training. More particularly, the present invention is directed to speech recognition training using Hidden Markov Models.
A popular approach to performing speech recognition is to use Hidden Markov Models (HMMs). An HMM is a probabilistic function of a Markov chain and can be defined as {S,X,II,A,B}, where S={s.sub.1, s.sub.2, . . . , s.sub.n } are the Markov chain states, X denotes the HMM output (observation) set, II is a vector of state initial probabilities, A=a.sub.ij !.sub.n,n is a matrix of state transition probabilities (a.sub.ij =Pr{s.sub.j .vertline.s.sub.i }), and B(x)=diag {b.sub.j (x)} is a diagonal matrix of the output x.epsilon.X conditional probability densities in state s.sub.j. If X is discrete, B(x) is a matrix of probabilities (b.sub.j (x)=Pr {x.vertline.s.sub.j }). Without loss of generality, states are denoted by their indices (s.sub.i=i).
In order for a device to perform speech recognition, that device must first fit HMMs to experimental data which entails generating model parameters. This process is referred to as "training" the speech recognition device.
There are a number of well-known ways for building a Hidden Markov Model for speech recognition. For example, as set forth in L. Rabiner et al, "Fundamentals of Speech Recognitionp", Chapter 6, Section 6.15, a simple isolated word recognition model can be created by assigning each word in a vocabulary a separatic model, and estimating the model parameters (A, B, .pi.) that optimizes the likelihood of the training set observation vectors for that particular word. For each unknown word to be recognized, the system (a) carries out measurements to create an observation sequence X via feature analysis of the speech corresponding to the word; (b) calculates the likelihood for all possible word models; and (c) selects the word whose model likelihood is highest. Examples of the other speech recognition systems using Hidden Markov Models can be found in Rabiner et al. and in U.S. Pat. Nos. 4,587,670 to Levinson et al. (reissued as Re33,597) and 4,783,804 to Juang et al. which are incorporated by reference herein.
There are various known methods to perform training using HMMs by optimizing a certain criterion (e.g., a likelihood function, an a posteriori probability, an average discrimination measure, etc.). However, these known methods all have drawbacks. For example, known methods that use the Newton-Raphson algorithm or the Conjugate Gradient algorithm, both of which are disclosed in W. H. Press, et al., "Numerical Recipes in C", Cambridge University Press (1992), converge fast in a small vicinity of optimum, but are not robust. Therefore, if parameter values are not very close to optimum, they might no converge.
Further, a known method that uses the Baum-Welch algorithm in conjunction with the forward-backward algorithm (the "Baum-Welch" method) is disclosed in L. E. Baum et al., "A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains", Ann. Math. Statist, 41, pp. 164-171 (1970). Training using this method converges slowly and requires a large amount of memory. therefore, training using this method must be implemented on a powerful computer with a large amount of memory.
Various approaches are known that speed up the Baum-Welch method. For example, W. Turin, "Fitting Probabilistic Automata via the EM Algorithm", Commun. Statist.--Stochastic Models, 12, No. 3, (1996) pp, 405-424 discloses that the speed of the forward-backward algorithm can be increased if observation sequences have repeated patterns and, in particular, long stretches of repeated observations. S. Sivaprakasam et. al., "A Foward-Only Procedure for Estimating Hidden Markov Models", GLOBECOM (1995) discloses that in the case of discrete observations, a forward only algorithm can be used that is equivalent to the forward-backward algorithm. However, these known approaches require specialized situations (i.e., long stretches of repeated observations and discrete observations).
Based on the foregoing, there is a need for a speech recognition training method and apparatus for generalized situations that is robust and does not require a large amount of memory.