In a typical speech recognition application, the user inputs into an input device such as a microphone or telephone set. If valid speech is detected, the speech recognition layer is invoked in an attempt to recognize the unknown utterance. In a commonly used approach, on a first pass search, a fast match algorithm is used to select the top N orthography groups from a speech recognition dictionary. In a second pass the individual orthographies from the selected groups are re-scored using computations on more precise speech models. The top orthography in each of the top two groups is then processed by a rejection algorithm that evaluates if they are sufficiently distinctive from one another so the top choice candidate can be considered to be a valid recognition.
Speech recognition systems can be assigned to two distinct categories, namely speaker-specific (or speaker dependent) and speaker independent. These categories differ primarily in the manner these systems are trained and used.
Training of a speech recognition system establishes a reference memory and speech models to which are assigned speech labels. For speaker-independent systems, training is performed by collecting samples from a large pool of users. For a given speech-recognition task, a speaker-specific (SS) system generally performs better than a speaker-independent (SI) system. Typically, for a speaker-independent system, a single speech model is used for all speakers while in a speaker-specific system, each user is assigned a respective speech model set. Speaker-specific systems are trained by collecting samples from the end user. For example, a voice dictation system where a user speaks and the device translates his words into text will most likely be trained by the end user (speaker-specific) since this training fashion can achieve a higher recognition performance. In the event that someone else than the original user wants to use the same device, that device can be retrained or an additional set of models can be trained and stored for the new user. When the training data for training the speaker specific systems is not readily available, speaker independent systems tend to be used as well. In addition, as the number of users becomes large, storing a separate speaker specific speech model set for each user becomes prohibitive in terms of memory requirements. Therefore, as the number of users becomes large, speech recognition systems tend to be speaker independent.
A common approach to improve the performance of speaker independent speech recognition systems is adaptation: adjusting either speech models or features in a manner appropriate to the current speaker and environment. A typical adaptation technique is model adaptation. Generally, speaker adaptation starts with speaker-in independent speech models derived from one or more speakers and then, based on a small amount of speech from a new speaker, creates new speaker-dependent models so that the recognition of the new speaker is improved. For a more detailed explanation on model adaptation, the reader is invited to consult R. Schwartz and F Kubala, Hidden Markov Models and Speaker Adaptation, Speech Recognition and Understanding: Recent Advances, Eds: P. Laface et R. De Mori, Springer-Verlag, 1992; L. Neumeyer, A. Sankar and V. Digalakis, A Comparative Study of Speaker Adaptation Techniques, Proc. Of EuroSpeech '95, pp.1127-1130, 1995; J.-L. Gauvain, G.-H. Lee, Maximum a. Posteriori Estimation for Multivariate Gaussain Mixture Observations of Markov Chains, IEEE. Trans. on Speech and Audio Processing, Vol 2, April 1994, pp.291-298; and C. J. Leggetter, P C. Woodland, Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous Density Hidden Markov Models, Computer, Speech and Language, Vol.9, 1995, pp171-185. The content of these documents is hereby incorporated by reference.
A deficiency in the above-described methods is that they require a relatively large amount of data on the basis of the usual relationship between training data and parametric complexity. While humans seem to be able to adapt to a new speech environment in just a few syllables, such speech recognition system adaptation requires considerably more adaptation data which may not be available.
A common approach makes use of a single set of transformation parameters for a set of speech models. In this manner a reduced number of transformation parameters permits effective adaptation with a reduced amount of data. A deficiency in tying the transformation parameters is that any model specific transformation will not be reflected in the single set of parameters.
Another method is an adaptation method described in Kuhn R. et al. (1998), “Eigenvoices for speaker adaptation,” Proc. ICSLP '98, vol. 5, Sydney, pp.1771-1774. The content of this document is hereby incorporated by reference. This adaptation method requires less adaptation data then the methods mentioned previously. A deficiency of the method presented by Kuhn R. et al. (1998) is that it provides an improved performance for speakers for whom training data is available. A deficiency of this approach is that the acoustic characteristics of specific speakers, not part of the training data, is generally not captured by this method.
Consequently, there is a need in the industry for providing a method and apparatus for providing a speech recognition system capable of adapting speech models on the basis of a minimal amount of data, preferably from the first word given by a given speaker.