The present invention relates generally to speech recognition and more particularly to speaker adaptation techniques for adapting the speech model of a speech recognizer to the speech of an individual user. The adaptation technique is unsupervised; that is, the adaptation system does not have a priori knowledge of the adaptation speech. Adaptation data is extracted from the adaptation speech using the speaker independent models available to the recognizer.
Speech recognizers in popular use today employ speech models that contain data derived from training speakers. In many cases training speech from these speakers is collected in advance, and used to generate speaker independent models representing a cross section of the training speaker population. Later, when the speech recognizer is used, data extracted from speech of the new speaker is compared with the speaker independent models and the recognizer identifies the words in its lexicon that represent the best match between the new speech and the speech models.
More often than not, the new speaker did not participate in the original speech model training. Thus the new speaker's speech may not be accurately represented by the speaker independent speech models. If the new speaker's speech patterns are sufficiently similar to those of the training population, then the recognizer will do a reasonably good job of recognizing speech provided by that new speaker. However, if the new speaker has a strong regional accent or other speech idiosyncrasies not reflected in the training population, then recognition accuracy falls off significantly.
To enhance the reliability of the speech recognizer, many recognition systems implement an adaptation process whereby adaptation speech is provided by the new speaker, and that adaptation speech is used to adjust the speech model parameters so that they more closely represent the speech of the new speaker. Some systems require a significant quantity of adaptation speech. New speakers are instructed to read long passages of text so that the adaptation system can extract the adaptation data needed to adapt the speech models. Having the new speaker read text that is known by the adaptation system in advance is referred to as "supervised" adaptation. It is generally easier to devise adapted models under supervised conditions because the adaptation system knows what to expect and can more readily ascertain how the new speaker's utterance differs from the expected.
However, in many applications it is not feasible or convenient for the new speaker to participate in a lengthy adaptation session. Indeed, in some applications it is simply not feasible to ask the user to speak adaptation sentences before using the system. These applications thus dictate "unsupervised" adaptation.
Performing unsupervised adaptation is considerably more difficult because the content of the adaptation data is not known in advance. More precisely, transcriptions of the adaptation data (labels associated with the adaptation data) are not known in advance. The recognizer must therefore attempt to provide its own transcriptions of the input utterance using its existing speech models. Depending on the quality of the models used to recognize speech, many errors can be introduced into the transcriptions. These errors, in turn, may propagate through the adaptation system, resulting in adapted speech models that do not accurately reflect the new speaker's speech. The adapted models may be no better than the speaker independent models, or they may even be worse.
The present invention provides a vehicle for speaker adaptation that is particularly well-suited to the task of unsupervised adaptation where only a small amount of adaptation data has been provided. Using the invention, adaptation data is supplied to the recognizer, which generates the N-best solutions (instead of merely generating the single best solution). These N-best solutions are then processed to extract reliable information by means of either a weighting technique or a non-linear threshold technique. This reliable information is then used to modify how the model adaptation system performs upon the speech models. The speech models can be adapted using a variety of techniques, including transformation-based adaptation techniques such as Maximum Likelihood Linear Regression (MLLR) and Bayesian techniques such as Maximum A Posteriori (MAP).
Although the reliable information extracted from the N-best solutions can be used in a single pass model adaptation system, the technique can also be performed iteratively. The iterative technique derives a first adapted model, as outlined above, and then uses the adapted model in a subsequent recognition cycle performed upon the adaptation data. The adaptation cycle can iteratively repeat multiple times. Each time the N-best solutions are determined and reliable information is extracted from those solutions to adjust how the model adaptation process is performed. A convergence testing mechanism monitors the N-best solutions to determine when to halt further iterations.
For a more complete understanding of the invention, its objects and advantages, refer to the following specification and to the accompanying drawings.