1. Field of the Invention
The invention relates to the field of speech processing and, more particularly, to speech recognition systems.
2. Description of the Related Art
A speech recognition system can recognize speech and render a text corresponding to the recognized speech. In general, a speech recognition system can identify features in a spoken utterance, and based on the identified features, distinguish the utterance from other words or phrases of a defined vocabulary. The speech recognition system can identify words, phonemes, morphemes, or other sub-word units of speech by evaluating the identified features during a speech recognition task. These units of speech can be associated with a text or a phonetic string that corresponds to the spoken utterance.
Speech recognition systems and natural language understanding systems can also include grammars. The grammars can define the rules of interaction among the units of speech during the recognition of a word or phrase. For a particular vocalization, or utterance, processed by such a system, the utterance may contain a word or phrase that matches one in an active grammar set, and that the system correctly recognizes as a match, thereby yielding a correct acceptance decision by the system. The utterance also may contain a word or phrase that does not have a match in the active grammar, and that the system correctly rejects, yielding a correct rejection decision by the system.
However, speech recognition systems can yield recognition errors. Certain words and phrases may be confused for similarly sounding words or phrases based on the grammars or features. One type of error relating to an active grammar set is the false acceptance of a word or phrase that is incorrectly interpreted as matching one in an active set. Another type of error is a false rejection, which occurs when a word or phrase that has a match in the active set is not recognized. Still another type of error can occur when a word or phrase of an utterance has a match in the active set, but is incorrectly interpreted as matching a different word or phrase, this type of error typically being characterized as a “false acceptance—grammar.” The speech recognition system may not be aware of such errors. However, the system can learn from the errors if the system is made aware of the errors.
Manual transcription is a process of having a person transcribe an audio recording of a spoken utterance to textual form. With regard to speech recognition systems that convert spoken utterances to a text, a manual transcription of the spoken utterance can be referenced for identifying text errors in the speech recognition results. For example, the person can compare the manual transcription of the spoken utterance to the text produced by the speech recognition system. Results can be validated by identifying those utterances that were incorrectly recognized. Understandably, the validation does not require a direct comparison of the manual transcription against the text results from the speech recognition system. The recognition results need not be used as a guide or starting point for the person performing the transcribing. The person performing the transcription can simply write down the text he or she hears being spoken in the utterance.
In one aspect, incorrectly recognized utterances can be used for retraining the speech recognition system. The validation process can also reveal which grammars need to be re-tuned or updated. In practice, a person listens to a spoken utterance and determines whether the recognition result is correct. For example, the spoken utterance can be presented in an audible format and the recognition result can be presented as corresponding text. The person can determine whether the text correctly corresponds to the audible spoken utterance. If a recognition result is incorrect, the user can manually update the recognition result with the correct transcription. In general, the user edits the text to correct mistakes during transcription.
Manual transcription, however, is typically a tedious process that requires human input to validate and manually correct recognition results. In addition, speech recognition systems may process hundreds or even thousands of utterances creating enormous amounts of data. The user may not be aware which utterances were interpreted less correctly than other utterances, and/or which utterances should be used to update a training or tuning of the speech recognition system. A need therefore exists for improving the efficiency by which manual transcription validates recognition results so that the performance of a speech recognition system can be enhanced.