1. Field of the Invention
This invention relates to the field of speech recognition systems. More specifically, this invention relates to user inteifaces for speech recognition systems, and yet more specifically to a method and apparatus for assisting a user in reviewing transcription results from a speech recognition dictation system.
2. Description of the Related Art
Text processing systems, e.g. word processors with spell checkers, such as Lotus WordPro.TM. and Word Perfect.TM. by Novell, can display misspelled words (i.e. words not recognized by a dictionary internal to the word processor) in a color different from that of the normal text. As a variant, Microsoft Word.TM. underlines misspelled words in a color different from that of the normal text. In these cases, it is simple to ascertain the validity of a word by checking it against dictionaries. Either a word is correctly spelled or it is not. However, these aspects of known text processing systems deal only with possible spelling errors. Additionally, because spellcheckers in text processing systems use only a binary, true/false criterion to determine whether a word is correctly (or possibly incorrectly) spelled, these systems will choose one of two colors in which to display the word. In other words, there are no shades of gray. The word is merely displayed in one color if it is correctly spelled and in a second color if the system suspects the word is incorrectly spelled. Grammar checking systems operate similarly, in that the system will choose one of two colors in which to display the text depending upon whether the system determines that correct grammar has been used.
By contrast, the inventive method and apparatus deal with speech recognition errors, and in particular with levels of confidence that a speech recognition system has in recognizing words that are spoken by a user. With the method and apparatus of the present invention, an indication is produced, which is correlated to a speech recognition engine's calculated probability that it has correctly recognized a word. Whether or not a word has been correctly recognized, the displayed word will always be correctly spelled. Additionally, the inventive system supports multiple levels of criteria in determining how to display a word by providing a multilevel confidence display.
In another area, known data visualization systems use color and other visual attributes to communicate quantitative information. For example, an electroencephalograph (EEG) system may display a color contour map of the brain, where color is an indication of amplitude of electrical activity. Additionally, meteorological systems display maps where rainfall amounts or temperatures may be indicated by different colors. Contour maps display altitudes and depths in corresponding ranges of colors. However, such data visualization systems have not been applied to text, or more specifically, to text created by a speech recognition/dictation system.
In yet another area, several speech recognition dictation systems have the capability of recognizing a spoken command. For example, a person dictating text, may dictate commands, such as "Underline this section of text", or "Print this document". In these cases, when the match between the incoming acoustic signal and the decoded text has a low confidence score, the spoken command is flagged as being unrecognized. In such a circumstance, the system will display an indication over the user interface, e.g. a question mark or some comment such as "Pardon Me?". However, obviously such systems merely indicate whether a spoken command is recognized and are, therefore, binary, rather than multilevel, in nature. In the example just given, the system indicates that it is unable to carry out the user's command. Thus, the user must take some action. Such systems fail to deal with the issue of displaying text in a manner that reflects the system's varying level of confidence in its ability to comply with a command.
In yet another area, J. R. Rhyne and G. C. Wolf's chapter entitled "Recognition Based User Interfaces," published in Advances in Human-Computer Interaction, 4:216-218, Ablex, 1993, R. Hartson and D. Hix, editors, states "the interface may highlight the result just when the resemblance between the recognition alternatives are close and the probability of a substitution error is high." However, this is just another instance of using binary criteria and is to be contrasted with the multilevel confidence display of the present invention. Furthermore, this reference merely deals with substitution error and lacks user control, unlike the present invention which addresses not only substitution errors but also deletion errors, insertion errors, and additionally, provides for user control.
Traditionally, when users dictate text using speech recognition technology, recognition errors are hard to detect. The user typically has to read the entire dictated document carefully word by word, looking for insertions, deletions and substitutions. For example, the sentence "there are no signs of cancer" can become "there are signs of cancer" through a deletion error. This type of error can be easy to miss when quickly proof reading a document.
It would be desirable to provide a system that displays transcribed text in accordance with the system's level of confidence that the transcription is accurate. It also would be desirable if such a system could display more than a binary indication of its level of confidence.