1. Field of Invention
The invention relates to an unsupervised, discriminative, sentence level, Hidden Markov Model (HMM) adaptation based on speech-silence classification.
2. Description of Related Art
A large part of the speech recognition literature deals with the problems caused to real-world recognition systems by noise, distortion or variability in the speech waveform. Various algorithms have been proposed to deal with these problems, such as cepstral mean normalization, maximum likelihood (ML) cepstrum bias normalization, ML frequency warping and ML linear regression. Apart from these transformation-based techniques that produce good results with a limited amount of adaptation data, the acoustic models can be retrained using maximum a posteriori (MAP) adaptation. MAP adaptation typically requires a large amount of adaptation data. Algorithms have been proposed for updating groups of HMM parameters or for smoothing the re-estimated parameter values, such as field vector smoothing, classification tree or state-based clustering of distributions. Parallel model combination (PMC) has also been used to combat both additive noise distortion and multiplicative (channel) distortion.
Typically the aforementioned algorithms perform well for simulated data, i.e., when additive or multiplicative distortion is added to the speech signal in the laboratory, but not equally well in field trials where a multitude of sources with time-varying characteristics can distort the speech signal simultaneously. In many cases, very little data are available for adaptation. Further, the adaptation data might not be transcribed. It has been shown in numerous publications that discriminatively trained HMMs improve recognition accuracy. However, in the training process, it is assumed that the linguistic context of the utterances is known. Unsupervised adaptation using very few utterances is a very difficult problem because there are no guarantees that the adapted parameters will converge to globally optimum values.
In addition, acoustical mismatch between training and testing conditions results in significant accuracy degradation in HMM-based speech recognizers. Careful inspection of the recognition errors shows that word insertion and substitution errors often occur as a result of poor recognition scores for acoustic segments with low-energy phones. The underlying problem is that channel and noise mismatch have relatively greater influence on low-energy (low-amplitude) portions of the speech signal. Various blind deconvolution and bias removal schemes address this problem in the context of the general mismatch of the whole speech signal. Thus, the focus must lie on these critical regions of the acoustics speech signal, i.e., the regions where the signal characteristics of the background (representing non-speech segments) and the speech signal (typically unvoiced portions) are similar.
Thus, an effective way to adapt HMM parameters, in an unsupervised mode, during the recognition process in a way that increases discrimination between the background model and speech models for a particular sentence or set of sentences, is sought.