1. Technical Field
The present invention relates generally to automatic speech recognition (ASR) and, more particularly, to the process of tuning a speech recognition engine to improve its accuracy.
2. Description of the Related Art
Speech recognition is an imperfect art. Achieving high accuracy is difficult because multiple variables typically exist including, e.g., differences in microphones, speech accents, and speaker abilities. When Automatic Speech Recognition is performed over a telephone network, the task is even more difficult, owing to the noise and bandwidth limitations imposed on the speech signal.
It is known in the prior art to tune a speech recognition engine to increase the engine's level of accuracy. In the simplest example, speaker adaptation, such tuning is effected in a completely supervised manner, with the user of the system being prompted to read given text over a period of time. During this process, the speech recognizer is adapted to the user's voice. Examples of this approach are found in many commercial products, such as Dragon Dictate. These techniques generally require several minute sessions between the user and the system, and they are therefore inappropriate to telephone-based ASR, where most interactions last only a few utterances and the user identity usually cannot be saved for future sessions.
For larger, speaker-independent systems, tuning the recognizer to individual speakers is not practical or desirable. The goal of tuning such systems is to arrive at generally applicable models and algorithms. Nor is it possible in these systems to conduct any supervised sessions with the user population. In such cases, ASR providers tune their algorithms using human intervention. In particular, after the recognizer is deployed, a large quantity of speech data is collected. Human listeners then transcribe this speech data. Transcription requires careful and skilled listening to each utterance in the database, as well as excellent typing ability. Using the speech data and the human-provided transcriptions, the ASR provider then tunes the recognition engine as necessary and re-deploys the application. This type of tuning is not economical, and it is often not rapid enough to be useful in deploying large vocabulary ASR systems. Indeed, as the size of the vocabulary increases, such “supervised” tuning techniques become more inefficient and can fail to bring the system up to desired level of accuracy in a practical amount of development time and expense. The present invention addresses this problem.