In most speech recognition systems, some kind of acoustical preprocessing of the speech input signal is performed to reduce distortions in the speech input due to ambient noise. The actual extraction of speech information is then performed using a statistical speech model. In many cases, hidden Markov models (HMM) are employed for these purposes. This hidden Markov models correspond to a Markov process of first order, the emission probabilities of which are modeled by a Gaussian mixture model (GMM). The parameters of the GMMs constitute the codebook of the speech recognizer.
Speaker independent speech recognizers work quite well in many cases. However, as there is no optimization to particular speakers, the recognition reliability is not always satisfactory.
Due to this, sometimes a speaker training is performed to adapt the speech recognizer to a particular person. Often, this is a supervised training in which the user undergoes a particular training phase. During this training phase, a given list of utterances is to be spoken. Via these training utterances, a speaker-dependent codebook is created. Such training, in general, significantly increases the recognition rate for a particular user.
Alternatively, an unsupervised adaptation method may be used. Here, a speaker does not undergo an explicit training phase; the system rather adapts the codebook using the speech input during use. Conventional unsupervised adaptation methods, however, have the drawback that a speaker change is not always correctly detected. Furthermore, they suffer under ambient noise which is usually present during use of the speech recognition system. In particular, there is always a risk that the statistical models employed are trained with respect to the acoustical environment and not with respect to the speaker characteristic.