The present invention relates to acoustic models in a speech recognition system. More specifically, the present invention relates to adaptation of compressed gaussian models used in computer implemented speech recognition.
A speech recognition system receives a speech signal and attempts to decode the speech signal to identify a string of words represented by the speech signal. Conventional speech recognizers include, among other things, an acoustic model and a language model formed usually from training data. The acoustic model models the acoustic features of speech units (such as phonemes) based on the training data. The language model models word order as found in the training data.
When the speech signal is received for speech recognition, acoustic features are extracted from the speech signal and compared against the models in the acoustic model to identify speech units contained in the speech signal. Potential words are compared against the language model to determine the probability that a word was spoken, given its history (or context).
It is often desirable to design speech recognizers so that they may be used with computer systems with less processing power and/or less memory capabilities without losing speech recognition accuracy. One significant memory intensive portion of a speech recognition system is the storing of the acoustic model. In a Hidden Markov Model (HMM) based speech recognition system, the acoustic model commonly consists of tens of thousands of multi-dimensional gaussian probability distributions with diagonal covariance matrices. For example, the gaussian distributions can each be 39 dimensions. Each dimension requires a mean and a variance. Therefore, if a model has 40,000 gaussians of 39 dimensions, each having a mean and a variance, which is typically stored as a four byte floating point value, the model would take over ten megabytes to store.
Storing each mean with a byte and each variance with a byte can be done with scalar quantization and often results in no degradation in error rate and a factor of 4 compression (the model in the example above would be 2.5 MB). One such type of scalar quantization is linear scalar quantization, which can be done by finding the maximum and minimum value of each parameter and linearly quantizing the points in between.
Known clustering techniques can be used to compress the acoustic model so that it takes less memory to store. Generally, this technique is referred to as subspace coding and involves grouping different dimensions together. Typically, the representative gaussian distributions are stored in a codebook for each group of dimensions. The codebooks are stored to form the acoustic model and accessed through an index during speech recognition to process an input signal.
Also, conventionally, acoustic models are trained using many different speakers. Those speakers can be, for example, male and female with different accents and having different voice pitches. The speakers may speak quickly or slowly. The acoustic models are trained using all of these types of speakers to obtain a speaker-independent acoustic model which works well across a broad range of users.
However, it is widely recognized that speaker-dependent acoustic models are more accurate for a given speaker than are speaker-independent acoustic models. In order to adapt acoustic models, in the past, training data was collected from the speaker for which the model was to be adapted. Model transformations were then estimated and applied against the acoustic model. There are a variety of known ways for adapting acoustic models. One conventional technique for adapting conventional acoustic models is set out in Leggetter and Woodland, SPEAKER ADAPTATION OF CONTINUOUS DENSITY HMM USING MULTIVARIATE REGRESSION, Computer Speech and Language, volume 9, pages 171-185 (1994).
However, when models are compressed into subspaces, as discussed above, Gaussians in the acoustic models are quantized in subspaces. The conventional speaker adaptation procedures (such as MLLR) cannot be applied to such models, because the adapted means will no longer be compressed, and would therefore require more memory.