1. Technical Field of the Invention
The present invention relates to speech recognition systems and, more particularly, to speaker adaptation using feedback.
2. Background Art
Speech recognition systems using only Speaker Independent (SI) models are very sensitive to different speakers due to speaker characteristic variations. SI models typically use a Hidden Markov Model (HMM). Speaker adaptation is a process to adapt a SI model to a speaker dependent (SD) model to capture the physical characteristics of a given speaker. Speaker adaptation techniques can be used in supervised and unsupervised mode. In supervised mode, the correct transcription is known, while in unsupervised mode, no correct transcription is available.
For reliable and robust speaker adaptation, large amounts of adaptation data are often required in order to cover the linguistic units of a given language. However for most practical applications, only a limited amount of adaptation data is available. Efficient use of the adaptation data becomes extremely important. The traditional adaptation schemes treat all the adaptation data indiscriminately, which results in some parts of the adaptation data being relatively under-trained or under-weighted. Usually the under represented words are more unlikely to be recognized by the decoder.
The traditional adaptation scheme is as follows:
1. Given some adaptation enrollment data and a SI model, collect statistics on the enrollment data and perform speaker adaptation on the SI model.
2. Decoding the test utterances with the adapted acoustic model. Such a scheme uses the enrollment data only once and does not incorporate any feedback from decoding. It is fast in practice, but does not always provide good performance.
Approaches to speaker adaptation include those described in J. L. Gauvain et al. “Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov Chain,” IEEE Trans. On Speech and Audio Processing, Vol. 2, pp. 291-298; L. R. Bahl, et al., “A New Algorithm for the estimation of Hidden Markov Model Parameters,” IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 493-496, 1988; and C. L. Leggetter et al., “Maximum likelihood linear regression for speaker adaptation of continuous density HMMs,” Computer Speech and Language, Vol. 9, pp. 171-185, 1995. In some of these approaches, errors included in recognizing a particular speaker's utterances are not considered. In a “corrective training” approach, such as in the above-recited L. R. Bahl et al. article, an error in recognition of the utterance may be considered, but a very complicated technique is used to compensate for it. Background articles on expectation maximization (EM) maximum likelihood (ML) are provided in the articles A. P. Dempster, et al., “Maximum likelihood from incomplete data via the EM algorithm,” Journal of the Royal statistical Society, Series B 39, pp. 1-38, 1977; and N. Laird, “The EM algorithm,” Handbook of Statistics, vol. 9. Elsevier Science Publishers B. V. 1993.
An iterative technique in speech recognition is to recognize utterances based on an SI model and to create an SD model therefrom and then to apply the SD model to recognizing the utterances to create a more refined SD model and so forth.
There is a need for improved techniques for speaker adaptation. Such improved techniques are described in this disclosure.