Hierarchical Linear Regression (HLR) (e.g. MLLR [See C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density HMMs,” Computer, Speech and Language, 9(2):71–185, 1995]) is now a common technique to transform Hidden Markov Models(HMMs) for use in an acoustic environment different from the one in which the models are initially trained. The environments refer to speaker accent, speaker vocal tract, background noise, recording device, transmission channel, etc. HLR improves word error rate (WER) substantially by reducing the mismatch between training and testing environments [See C. J. Leggetter cited above].
Hierarchical Linear Regression (HLR) is a process that creates a set of transforms that can be used to adapt any subset of an initial set of Hidden Markov Models (HMMs) to a new acoustic environment. We refer to the new environment as the “target environment”, and the adapted subset of HMM models as the “target models”. The HLR adaptation process requires that some adaptation speech data from the new environment be collected, and converted into sequences of frames of vectors of speech parameters using well-known techniques. For example, to create a set of transforms to adapt an initial set of speaker-independent HMMs to a particular speaker who is using a particular microphone, adaptation speech data must be collected from the speaker and microphone, and then converted into frames of parameter vectors, such as the well-known cepstral vectors.
There are two well known HLR methods for creating a set of transforms. In the first method, the adaptation speech data is aligned to states of the initial set of HMM models using well-known HMM Viterbi recognition alignment methods. A regression tree is formed which defines a hierarchical mapping from states of the initial HMM model set to linear transforms. Then the set of linear transforms is determined that adapts the initial HMM set so as to increase the likelihood of the adaptation speech data. While this method results in better speech recognition performance, further improvement is possible. The second method uses the fact that transforming the initial HMM model set by the first set of linear transforms yields a second set of HMMs. This second set of HMMs can be used to generate a new alignment of the adaptation speech data to the second set of HMMs. Then it is possible to repeat the process of determining a set of linear transforms that further adapts the second HMM set so as to increase the likelihood of the adaptation data. This process can be repeated iteratively to continue improving the likelihoods. However, this requires that after each iteration either a new complete set of HMMs is stored, or that each new set of linear transforms is stored so that the new HMM set can be iteratively derived from the initial HMM set. This can be prohibitive in terms of memory storage resources. The subject of this invention is a novel implementation of the second method such that only the initial HMM set and a single set of linear transforms must be stored, while maintaining exactly the performance improvement of the second method and reducing the processing required. This is important in applications where memory and processing time are critical and limited resources.