I. Field of the Invention
The present invention relates to apparatus and method for training the statistics of a Markov Model speech recognizer to a subsequent speaker after the recognizer has been trained for a reference speaker.
II. Description of the Problem
One approach to speech recognition involves the use of Hidden Markov Models (HMM). Hidden Markov Models have been discussed in various articles such as: "Continuous Speech Recognition by Statistical Methods" by F. Jelinek, Proceedings of the IEEE, volume 64, number 4, 1976 and "A Maximum Likelihood Approach to Continuous Speech Recognition", by L. R. Bahl, F. Jelinek, and R. L. Mercer IEEE Transactions on Pattern Analysis and Machine Intelligence, volume PAMI-5, Number 2, March 1983. These articles are incorporated herein by reference.
In performing speech recognition based on Hidden Markov Models, successive intervals of speech are examined by an acoustic processor with respect to various predefined characteristics of speech. For example, respective amplitudes for each of various energy frequency bands are determined for each time interval. Each respective amplitude represents a component, or feature. Together, the components combine to form a feature vector.
The acoustic processor defines a finite set of prototype, or reference, vectors. Each prototype vector has a unique label which identifies it. The feature vector at each successive time interval is compared with each prototype vector. Based on a prescribed distance measure, the closest prototype is selected. Hence, for each time interval a prototype vector (which most clearly represents the feature vector of the interval) is selected. As speech is uttered, the acoustic processor provides as output a string of labels.
In accordance with Markov models speech recognition, a set of Markov models is defined. Typically, such Markov models have corresponded one-to-one with phonetic elements. For eighty phonetic elements, then, there are eighty respective Markov models. The Markov models corresponding to successive phonetic elements of a word can be concatenated to form a Markov model baseform for the word.
Each Markov model is characterized as having a plurality of states and a plurality of transitions. Each transition extends from a state to a state. At least some transitions represent a time interval during which a prototype vector label is selected by the acoustic processor. For each transition, there is a transition probability and, in some cases, output probabilities. Typically associated, the transition probabilities indicate the likelihood of following a given transition in a Markov model. THe output probabilities indicate the likelihood of a certain output label (e.g., prototype vector label) being produced during a given transition.
For a certain transition A.sub.ij extending from state i to state j, there is an associated transition probability P(A.sub.ij) and, where there are 200 different prototype vectors, there are 200 associated output probabilities: ##EQU1## Normally, but no necessarily, the skeletal structure of states with connecting transitions (without probability values assigned) is the same for each Markov model.
For a given speaker, the various respective Markov models for the different phonetic elements differ typically in the values of the probabilities associated therewith. In order to be operative, the various transition probabilities and output probabilities for each Markov model must be determined.
The physical implementation of the Markov model is referred to as a "phone machine", or Markov model phone machine. The phone machine for a corresponding phonetic element includes memory locations for storing the transition probabilities, output probabilities, shape of the phone machine, identifiers indicating which phonetic element is represented thereby, and other such information which characterizes the respective Markov model.
The process of determining the transition probabilities and output probabilities so that they may be stored for phone machines is referred to as "training."
Typically, a distinct set of transition probabilities and output probabilities must be determined for each speaker. That is, for each speaker, the speech recognizer stores data (e.g., transition probability values and output probability values) for a respective set of phone machines.
The conventional approach to training is for a speaker to utter a known sample text into an acoustic processor. The sample text represents a known sequence of phonetic elements and, hence, a known corresponding sequence of phone machines. The acoustic processor generates a string of prototype labels in response to the uttered speech input. From the string of prototype labels generated for the known sample text and from initially set values of the transition probabilities and output probabilities (which may not reflect actual speech characteristics), improved probability values can be determined by applying a forward-backward algorithm, or Baum-Welch algorithm, to produce transition counts and output counts, deriving transition probabilities and output probabilities therefrom, applying the forward-backward algorithm with the derived probabilities to produce updated counts, and so on over a number of iterations. The probability values after the last iteration are referred to herein as "basic" transition probabilities and "basic" output probabilities.
In order to generate reasonably accurate "basic" probabilities, it is necessary for a speaker to utter a relatively long sample text, extending, for example 20 minutes.
In accordance with prior technology, each speaker would be required to utter the 20-minute sample text in order to train the speech recognizer to his/her speech.
A required training period of 20 minutes per speaker may be undesirably long and inconvenient.
Also however, the amount of computing the speech recognizer must perform for 20 minutes of training text in accordance with the forward-backward algorithm in order to determine "basic" probabilities is excessive.
Accordingly, a significant problem in speaker dependent Markov model speech recognition has involved the lengthy period during which each speaker must utter text and the computationally costly process of applying the forward-backward algorithm to the full text for each speaker.
In a co-pending patent application by S. De Gennaro et al entitled "speech Recognition System", (Docket No. YO984-108), Ser. No. 06/845,155, filed Mar. 27, 1986, assigned to International Business Machines Corporation, some of the transitions are grouped together to have common output probabilities applied thereto. Although reducing the required amount of training data, the sample text has remained nonetheless lengthy when multiple speakers are to be recognized.