The present invention relates generally to a system and method for automatic speech recognition and, more specifically, to a system and method for automatically identifying, predicting, and implementing edits desired to the output of automatic speech recognition applications.
Even when there is no speech recognition error, natural speech does not always correspond to the desired content and format of written documents. Such lack of correspondences is due to speech recognition errors, different conventions for spoken and formal written language, modifications during the editing and proofreading process. The lack of correspondences also often are repetitive.
Conventional speech recognition systems interpret speech by applying a variety of speech models, including acoustic models (AM) and linguistic models (LM). These speech models are essentially statistical models based on the combination of patterns of sounds, words, and even phrases. AMs are based on particular patterns of sounds or other acoustic units, while LMs are based on specific patterns of words or phrases.
Because natural speech does not always correspond closely to conventional speech models, typical speech recognition systems are prone to make errors which later must be corrected. These errors often are attributable to speaker-related phenomena. As such, many errors in the speech recognition process are repetitive. That is, speech recognition systems are prone to commit the same errors with certain words or phrases on a consistent basis.
Some errors and mismatches between speech and written output are attributable to the user's inability to speak the native language or differences between the conventions of written versus dictated language styles. These errors and mismatches are recurrent as the user continues to repeat words or phrases that fail to match with the acoustic and linguistic models or with written language style. For example, a user speaking Cantonese, a dialect of Chinese, inherently will trigger certain errors as the speech recognition software attempts to reconcile the Cantonese dialect with standard Chinese.
Other commonly-repeated errors or mismatches arise from the industry in which the speech recognition engine is used. Speakers in technical industries typically have frequently-used terms or jargon that may not appear in ordinary conversation and, therefore, are not readily understood by speech recognition systems. Other such jargon may be correctly recognized but may not be appropriate for final documents. As these terms are common to a particular industry, the speech recognition system continues to either misinterpret the terms or to print jargon that requires more formal wording, thereby propagating the same errors or mismatches throughout the interpreted speech. For instance, the medical or health care industry has scores of peculiar terminology not found in conversational language. The acoustic and linguistic models applied by the speech recognition system may lead to the improper interpretation of certain industry-specific terms. Alternatively, speakers may use shorthand or a telegraphic style in speech that must be written out more explicitly in final reports. As these terms may be used numerous times during the transcription of medical records, the errors and mismatches from the speech recognition system will be repeated document after document.
Still other recurrent errors arise from limitations in the speech recognition system itself, including both the speech recognition device and speech recognition applications. As a speech recognition system uses specific devices with specific applications, which are based on specific acoustic and linguistic models, any words or phrases that are improperly interpreted by the speech recognition system may be improperly interpreted on subsequent occasions, thereby repeating the same error.
A number of improved speech recognition systems have been developed; however, these systems have had limited success. For instance, several systems have been developed with more robust speech recognition models in an effort to eliminate errors altogether. These improved speech recognition systems use so-called larger “N-grams” in place of more conventional acoustic and linguistic models with trigrams to detect and interpret speech commands. Larger N-grams are more comprehensive than trigrams and, as such, consume considerably more space in the system's memory. Yet, even the most advanced speech models such as those with larger N-grams provide only marginally improved speech recognition capabilities, as these models only reduce errors stemming from the speech recognition device itself. Mismatches and errors resulting from the user and the industry continue to occur repeatedly, as larger N-grams do not address these limitations.
Many speech recognition systems have attempted to advance the art by learning from the specific user. By comparing the user's speech against known expressions, the speech recognition systems are able to adjust or improve upon conventional speech models. In doing this, the speech recognition system can fine-tune the speech models to the specific user or industry, thereby reducing future errors and mismatches. This process, often referred to as learning from so-called “positive evidence,” has had only modest success. Most notably, learning from positive evidence is a slow process, requiring considerable training. Additionally, specific errors or mismatches may continue to be repeated as the speech recognition system is only modifying the speech models based on the positive evidence and not addressing specific mismatches, errors, or types of errors.
There are relatively few speech recognition systems that are adapted to learn from so-called “negative evidence.” That is, few systems actually are configured to learn from actual errors or mismatches, particularly those which are systematically repeated. Additionally, known adaptive techniques are unable to account for the acoustic and speaker-related phenomena discussed above, particularly errors arising from the user's inability to speak the native language.
Accordingly, there is a need in the art for a speech recognition system with automatic error and mismatch correction capabilities for detecting and resolving systematically repeated errors and mismatches.