This invention relates to automatic speech recognition and, more particularly, to a method and system for adapting the models used in a speech recognition system to a particular speaker.
This art presumes a basic familiarity with statistics and Markov processes, as well as familiarity with the state of the art in speech recognition systems using Hidden Markov Models. The state of the art was discussed at length in related U.S. patent application Ser. No. 08/276,742 filed Jul. 18, 1994 and that discussion is incorporated herein by reference including the discussion of all prior art references cited.
By way of example of the state of the art in the particular field of adapting speech recognition systems to particular speakers, reference is made to the following patents and publications, which have come to the attention of the inventors in connection with the present invention. Not all of these references may be deemed to be relevant prior art.
______________________________________ Inventor U.S. Pat. No. Issue Date ______________________________________ Bahl et al. 4,817,156 03/28/89 Kuroda et al. 4,829,577 05/09/89 ______________________________________
Papers
L. R. Bahl, F. Jelinek and R. L. Mercer, "A Maximum Likelihood Approach to Continuous Speech Recognition," IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. PAMI-5(2), pp. 179-190, March 1983.
J. Bellegarda, "Robust Speaker Adaptation Using a Piecewise Linear Acoustic Mapping," Proceedings ICASSP, pp. I-445-I-448, San Francisco, Calif., 1992.
P. Brown, C.-H. Lee and J. Spohrer, "Bayesian Adaptation in Speech Recognition," Proceedings ICASSP, pp. 761-764, Boston, Mass., 1983.
K. Choukri, G. Chollet and Y. Grenier, "Spectral Transformations through Canonical Correlation Analysis for Speaker Adaptation in ASR, "Proceedings ICASSP," pp. 2659-2662, Tokyo, Japan, 1986.
S. Furui, "Unsupervised Speaker Adaptation Method Based on Hierarchical Speaker Clustering," Proceedings ICASSP, pp. 286-289, Glasgow, Scotland, 1989.
X. Huang and K.-F. Lee, "On Speaker-Independent, Speaker-Dependent and Speaker-Adaptive Speech Recognition," IEEE Trans. on Speech and Audio Processing, Vol. 1, No. 2, pp. 150-157, April 1993.
B.-H. Juang, "Maximum-Likelihood Estimation for Mixture Multivariate Stochastic Observations of Markov Chains," AT.backslash.&T Technical Journal, Vol. 64, No. 6, July-August 1985.
C.-H. Lee, C.-H. Lin and B.-H. Juang, "A Study on Speaker Adaptation of the Parameters of Continuous Density Hidden Markov Models," IEEE Trans. on Acoust., Speech and Signal Proc., Vol. ASSP-39(4), pp. 806-814, April 1991.
R. Schwartz, Y. L. Chow and F. Kubala, "Rapid Speaker Adaptation Using a Probabilistic Spectral Mapping," Proceedings ICASSP, pp. 633-636, Dallas, Tex., 1987.
A recent trend in automatic speech recognition systems is the use of continuous-mixture-density Hidden Markov Models (HMMs). A system and method for using HMMs to recognize speech is disclosed in related U.S. patent application Ser. No. 08/276,742 assigned to the assignee of this application. Despite the good recognition performance that HMM systems achieve on average in large vocabulary applications, there is a large variability in performance across individual speakers. Performance can degrade rapidly when the user is radically different from the training population, such as a user who speaks with a heavy accent. One technique that can improve the performance and robustness of a speech recognition system is to adapt the system to the speaker, and, more generally, to the channel and the task.
Two families of adaptation schemes have been proposed in the prior art. One is based on transforming an individual speaker's feature space so that it "matches" the feature space of the training population. This technique may be generally referred to as the Feature-Space Transformation-based approach (FST). This technique has the advantage of simplicity, and if the number of free parameters in the transformations is small, then this technique has the desirable characteristic of quick adaptation.
The second main family of adaptation methods follows a Bayesian approach of progressively transforming the HMMs so that the models best predict adaptation data from the individual speaker. In a Bayesian approach, model parameters are re-estimated using some prior knowledge of model parameter values. The Bayesian approach usually has desirable asymptotic properties, that is, the performance of the speaker-adaptive system will converge to the performance of a speaker-dependent trained system as the amount of adaptation speech increases. This method has the disadvantage that the adaptation rate is usually slow.
What is needed is a speaker adaptive method and system that has superior performance for individual speakers, including those who speak with very different accents from the training population, but that can adapt quickly to a particular speaker using a small amount of adaptation data.