While conventional speech recognizers based on hidden Markov models (HMMs) show a high level of performance in matched training and testing conditions, the accuracy of such speech recognizers typically drops significantly when used under unknown operating environments. Some types of speaker or environment adaptation schemes are usually used to combat this degradation. Obtaining adaptation data, however, is often expensive at least in terms of data collection. Moreover, it is sometimes not possible to gather such data in advance either because there may be simply too many operating speakers and environments, or because they are continuously changing as in telephony applications.
Most conventional unsupervised adaptation techniques use hypotheses generated by the speech recognizers as the adaptation transcriptions. For example, one popular unsupervised adaptation technique using this approach is maximum likelihood linear regression (MLLR). A more detailed discussion of the MLLR technique is presented, for example, in an article by C. Leggetter et al. entitled “Speaker Adaptation of Continuous Density HMMs Using Multivariate Linear Regression,” International Conference on Spoken Language Processing, pp. 451–454 (1994), which is incorporated herein by reference. The MLLR approach essentially adapts the mean vectors of HMMs by a set of affine transformation matrices to match speaker-specific testing utterances. Another conventional adaptation technique uses a maximum likelihood neural network (MLNN). The MLNN technique is described in detail, for example, in an article by D. Yuk et al. entitled “Adaptation to Environment and Speaker Using Maximum Likelihood Neural Networks,” Eurospeech, pp. 2531–2534 (September 1999), which is incorporated herein by reference. The MLNN approach can perform a nonlinear transformation of mean vectors and covariance matrices.
Although the MLLR and MLNN techniques show an improvement in many tasks, they are not suitable for incremental online adaptation for at least the following two reasons. First, since they use a set of matrices or complex neural networks as the transformation functions, all the parameters in the functions must be estimated using the adaptation data in an unsupervised manner, which requires relatively large amounts of data and computation time. Second, even after the parameters in the functions are estimated, the adaptation process may be slow because all the mean vectors in the recognizer must be transformed.