The present invention relates to classifiers used in computer processing. More particularly, the present invention relates to compression of gaussian models used in computer processing such as used in speech recognition.
A speech recognition system receives a speech signal and attempts to decode the speech signal to identify a string of words represented by the speech signal. Conventional speech recognizers include, among other things, an acoustic model and a language model formed usually from training data. The acoustic model models the acoustic features of speech units (such as phonemes) based on the training data. The language model models word order as found in the training data.
When the speech signal is received for speech recognition, acoustic features are extracted from the speech signal and compared against the models in the acoustic model to identify speech units contained in the speech signal. Potential words are compared against the language model to determine the probability that a word was spoken, given its history (or context).
It is often desirable to design speech recognizers so that they may be used with computer systems with less processing power and/or less memory capabilities without losing speech recognition accuracy. One significant memory intensive portion of a speech recognition system is the storing of the acoustic model. In a Hidden Markov Model (HMM) based speech recognition system, the acoustic model commonly consists of tens of thousands of multi-dimensional gaussian probability distributions with diagonal covariance matrices. For example, the gaussian distributions can each be 33 dimensions. Each dimension requires a mean and a variance. Therefore, if a model has 40,000 gaussians of 33 dimensions, each having a mean and a variance, which is typically stored as a four byte floating point value, the acoustic model would take over ten megabytes to store.
Storing each mean with a byte and each variance with a byte can be done with scalar quantization and often results in no degradation in error rate and a factor of 4 compression (the model in the example above would be 2.5 MB). One such type of scalar quantization is linear scalar quantization, which can be done by finding the maximum and minimum value of each parameter and linearly quantizing the points in between.
Known clustering techniques can be used to compress the acoustic model so that it takes less memory to store. Generally, this technique is referred to as subspace coding and involves grouping different components together Typically, the representative gaussian distributions are stored in a codebook for each dimension. The codebooks are stored to form the acoustic model and accessed during speech recognition to process an input signal. In view that representative gaussians are used, some accuracy will be lost for the benefit of a smaller acoustic model. The further the model is compressed, the more accuracy will be degraded. Current techniques use Euclidean distance which significantly reduce accuracy as soon as more than one component is grouped together.
An improved method for compressing gaussian distributions, while maintaining improved accuracy, is always beneficial. A smaller yet more accurate model is particularly beneficial to speech recognition; however, other applications may also yield improved performance.