The present invention relates to speech recognition. In particular the present invention relates to speaker adaptation for speaker independent speech recognition.
Acoustic variance across speakers is one of the main challenges of speaker independent (SI) speech recognition systems. A large amount of research has been conducted on adapting a system to a particular speaker in an attempt to deal with this problem. Conventional hidden Markov model (HMM) adaptation methods used for speaker independent speech recognition can be divided into three families: linear transformation, Bayesian learning, and speaker space. One form of linear transformation is maximum likelihood linear regression (MLLR), as discussed in C. J. Leggetter, P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,” Computer Speech and Language, vol. 9, pp. 171-185, 1995.; One form of Bayesian learning is maximum a posterior (MAP) learning, such as discussed in C.-H. Lee, C.-H. Lin and B.-H. Juang, “A study on speaker adaptation of the parameters of continuous density hidden Markov models,” IEEE Transactions on Signal Processing, vol. 9, pp. 806-814, 1991 One form of speaker space adaptation involves using Eigenvoices, as discussed in R. Kuhn, J. C. Junqua, P. Nguyen and N. Niedzielski, “Rapid speaker adaptation in eigenvoice space,” IEEE Transactions on Speech and Audio Processing, vol. 8, pp. 695-707, 2000. Depending on available data from a test speaker, adaptation algorithms usually estimate a limited number of parameters to obtain a precise description of the speaker.
Recently, a promising speaker-adaptation method, speaker selection training (SST), has emerged in the literature. See for example, D. Matrouf, O. Bellot, P. Nocera, G. Linares and J.-F. Bonastre, “A posteriori and a priori transformations for speaker adaptation in large vocabulary speech recognition systems,” in Proc. Eurospeech2001, vol. 2, pp. 1245-1248; M. Padmanabhan, L. Bahl, D. Nahamoo and M. Picheny, “Speaker clustering and transformation for speaker adaptation in speech recognition systems,” IEEE Transactions on Speech and Audio Processing, vol. 6, pp. 71-77, 1998; and S. Yoshizawa, A. Baba, K. Matsunami, Y. Mera, M. Yamada, A. Lee and K. Shikano, “Evaluation on unsupervised speaker adaptation based on sufficient HMM statistics of selected speakers,” in Proc. Eurospeech2001, vol. 2, pp. 1219-1222, 2001. SST selects a subset of cohort speakers from a set of training speakers, and builds a speaker-adapted (SA) model based on these cohorts. In general, SST is a two-stage process: cohort speaker selection and model generation.
Speaker selection training can make efficient use of very limited adaptation data. For example, given one adaptation utterance of several seconds, MLLR or MAP can hardly achieve improvements. However, the data may be a good enough index to select acoustically similar speakers from a pool of training speakers. This is due to the fact that a minimal amount of data can achieve excellent accuracy on speaker recognition. See D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Transactions on Speech and Audio Processing, vol. 3, pp. 72-83, 1995.
There are various practical implementations of SST. In the first stage, selecting cohorts, the key issue is to define a similarity measure. In M. Padmanabhan, L. Bahl, D. Nahamoo and M. Picheny, “Speaker clustering and transformation for speaker adaptation in speech recognition systems,” IEEE Transactions on Speech and Audio Processing, vol. 6, pp. 71-77, 1998 (“Pandmanabhan”), adaptation data from a test speaker are fed to speaker adapted HMMs of all training speakers to calculate a likelihood score used as a similarity measure. In D. Matrouf, O. Bellot, P. Nocera, G. Linares and J.-F. Bonastre, “A posteriori and a priori transformations for speaker adaptation in large vocabulary speech recognition systems,” in Proc. Eurospeech2001, vol. 2, pp. 1245-1248, (“Matrouf”) and S. Yoshizawa, A. Baba, K. Matsunami, Y. Mera, M. Yamada, A. Lee and K. Shikano, “Evaluation on unsupervised speaker adaptation based on sufficient HMM statistics of selected speakers,” in Proc. Eurospeech2001, vol. 2, pp. 1219-1222, 2001, (“Yoshizawa”) likelihood scores from a Gaussian mixture model (GMM) are used instead.
In the second stage, model generation, there are various options, such as HMM retraining, MAP adaptation, data transformation and model combination. These are set out in the following references: Matrouf, Pandmanabhan, and Yoshizawa.
However, retraining a speaker independent model by using data from selected cohorts is very time-consuming. Model combination is much faster, because only pre-calculated statistics are used in techniques such as that in the Yoshizawa paper.