A number of digitization techniques have been developed to convert input information into a computer-readable format. Automatic speech recognition (ASR) systems, for example, convert speech into text. In addition, optical character recognition (OCR) systems and automatic handwriting recognition (AHR) systems convert the textual portions of a document into a computer-readable format. In each case, the input information, such as speech segments or character segments, are recognized as strings of words or characters in some computer-readable format, such as ASCII.
Generally, a speech recognition engine, such as the ViaVoice™ speech recognition system, commercially available from IBM Corporation of Armonk, N.Y., generates a textual transcript using a combination of acoustic and language model scores to determine the best word or phrase for each portion of the input audio stream. Speech recognition systems are typically guided by three components, namely, a vocabulary, a language model and a set of pronunciations for each word in the vocabulary. A vocabulary is a set of words that is used by the recognizer to translate speech to text. As part of the recognition process, the recognizer matches the acoustics from the speech input to words in the vocabulary. Therefore, the vocabulary defines the words that can be transcribed.
A language model is a domain-specific database of sequences of words in the vocabulary. A set of probabilities of the words occurring in a specific order is also required. The output of the recognizer will be biased towards the high probability word sequences when the language model is operative. Thus, correct speech recognition is a function of whether the user speaks a sequence of words that has a high probability within the language model. Thus, when the user speaks an unusual sequence of words, the speech recognition performance will degrade. Word recognition is based entirely on its pronunciation, i.e., the phonetic representation of the word. For best accuracy, domain-specific language models must be used. The creation of such a language model requires large textual corpuses to compute probabilities of word histories. The quality of a language model can vary greatly depending, for example, on how well the training corpus fits the domain in which the speech recognition is performed, and the size of the training corpus.
While such domain-specific language models improve the accuracy of speech recognition engines, the accuracy of the transcribed text can nonetheless be degraded due to certain speech characteristics, such as fast speech, speech with background noise or speech with background music. Generally, conventional transcription processes utilize a single speech recognizer for all speech. Fast speech, however, contributes to additional errors in the transcription process. It is difficult to segment fast speech properly, since the time metrics vary for different speakers and words. Similar problems have been observed for other types of speech characteristics as well, such as speech with background noise and speech with music. For a discussion of the impact of such speech characteristics on the transcription process, see, for example, Matthew A. Singer, “Measuring and Compensating for the Effects of Speech Rate in Large Vocabulary Continuous Speech Recognition,” Thesis, Carnegie Mellon University (1995), incorporated by reference herein.
Thus, if input speech has certain characteristics that may degrade the transcription process, certain words or phrases may by improperly identified. A need therefore exists for a digitization system that reduces the error rate by using recognition techniques that have improved performance for certain characteristics on subsets of the input information that exhibit such characteristics.