The present invention relates to automatic speech recognition and, more particularly, relates to hierarchical transcription and display of input speech.
Transcription of words based on Automatic Speech Recognition (ASR) is a well known method that helps to improve the communication ability of the hearing impaired. A problem with this approach is that if the error recognition rate is relatively high, the transcription is not efficient for hearing impaired children who are still learning a language, as these children can be easily confused by the wrongly decoded words. An approach that addresses this problem is displaying phonetic output rather than words. This approach is, however, not optimal because reading correctly recognized words is easier and more efficient than reading phonetic output.
The use of ASR to teach hearing impaired people to read is also a well known method. In this approach, a reference text is displayed for a user and the ASR decodes the user speech while he or she reads aloud the text and compares the decoded output with the reference text. One reference that explains this use of ASR for this purpose is xe2x80x9cReading Tutor Using an Automatic Speech,xe2x80x9d Technical Disclosure Bulletin, Volume 36, Number 8, 08-93, pp. 287-290, the disclosure of which is hereby incorporated by reference. A problem with this approach is that any errors in speech recognition will make the user think that he or she has misspoken a word, while the error is actually the fault of the program.
Another problem with ASR occurs in noisy environments, such as occurs with a difficult channel like telephone or when speech is ridden with disfluencies. In these situations, a substantial number of errors is likely to occur. Although errors can sometimes be identified by the user because of the context, the resulting confusion and increased difficulty of interpretation may offset the benefits of word-based display. This is especially true when the user is a child who is in the process of learning the language. In this case, virtually no errors should be allowed.
While this problem is particularly problematic for children who are learning to speak properly, high error rates of ASR are also a general problem. As a person dictates into an ASR system, the system will make transcription decisions based on probabilities, and the decisions may be based on low probabilities. If the user does not immediately catch an incorrect transcription, the correct transcription may be hard to determine, even when the context is known.
Thus, what is needed is a way of limiting or solving the problems of a high recognition error rate when using ASR to improve the communication ability or the reading skills of hearing impaired people or both, or when using the ASR for other speech recognition purposes.
Generally, the present invention provides the ability to present a mixed display of a transcription to a user. The mixed display is preferably organized in a hierarchical fashion. Words, syllables and phones can be placed on the same display by the present invention, and the present invention can select the appropriate symbol transcription based on the parts of speech that meet minimum confidences. Words are displayed if they meet a minimum confidence or else syllables, which make up the word, are displayed. Additionally, if a syllable does not meet a predetermined confidence, then phones, which make up the syllable, may be displayed. A transcription, in one aspect of the present invention, may also be described as a hierarchical transcription, because a unique confidence is derived that accounts for mixed word/syllable/phone data.
A more complete understanding of the present invention, as well as further features and advantages of the present invention, will be obtained by reference to the following detailed description and drawings.