Speech is perhaps the oldest form of human communication and many scientists now believe that the ability to communicate through speech is inherently provided in the biology of the human brain. Thus, it has been a long-sought goal to allow users to communicate with computers using a Natural User Interface (NUI), such as speech. In fact, recently great strides have been made in obtaining this goal. For example, some computers now include speech recognition applications that allow a user to verbally input both commands for operating the computer and dictation to be converted into text. These applications typically operate by periodically recording sound samples taken through a microphone, analyzing the samples to recognize the phonemes being spoken by the user and identifying the words made up by the spoken phonemes.
While speech recognition is becoming more commonplace, there are still some disadvantages to using conventional speech recognition applications that tend to frustrate the experienced user and alienate the novice user. One such disadvantage involves the interaction between the speaker and the computer. For example, with human interaction, people tend to control their speech based upon the reaction that they perceive in a listener. As such, during a conversation a listener may provide feedback by nodding or making vocal responses, such as “yes” or “uh-huh”, to indicate that he or she understands what is being said to them. Additionally, if the listener does not understand what is being said to them, the listener may take on a quizzical expression, lean forward, or give other vocal or non-vocal cues. In response to this feedback, the speaker will typically change the way he or she is speaking and in some cases, the speaker may speak more slowly, more loudly, pause more frequently, or ever repeat a statement, usually without the listener even realizing that the speaker is changing the way they are interacting with the listener. Thus, feedback during a conversation is a very important element that informs the speaker as to whether or not they are being understood by the listener. Unfortunately however, conventional voice recognition applications are not yet able to provide this type of “Natural User Interface (NUI)” feedback response to speech inputs/commands facilitated by a man-machine interface.
Currently, voice recognition applications have achieved an accuracy rate of approximately 90% to 98%. This means that when a user dictates into a document using a typical voice recognition application their speech will be accurately recognized by the voice recognition application approximately 90% to 98% of the time. Thus, out of every one hundred (100) letters recorded by the voice recognition application, approximately two (2) to ten (10) letters will have to be corrected. In particular, existing voice recognition applications tend to have difficulty recognizing certain letters, such as “s” (e.g. ess) and “f” (e.g. eff). One approach existing voice recognition applications use to address this problem involves giving the user the ability to use predefined mnemonics to clarify which letter they are pronouncing. For example, a user has the ability to say “a as in apple” or “b as in boy” when dictating.
Unfortunately however, this approach has disadvantages associated with it that tends to limit the user friendliness of the voice recognition application. One disadvantage involves the use of the predefined mnemonics for each letter, which tend to be the standard military alphabet (e.g. alpha, bravo, charlie, . . . ). This is because that even though a user may be given a list of mnemonics to say when dictating, (e.g. “I as in igloo”) they tend to form their own mnemonic alphabet (e.g. “I as in India”) and ignore the predefined mnemonic alphabet. As can be expected, because the voice recognition applications do not recognize non-predefined mnemonics, letter recognition errors become commonplace. Another disadvantage involves the fact that while some letters have a small set of predominant mnemonics (i.e. >80%) associated with them (A as in Apple, A as in Adam or D as in Dog, D as in David or Z as in Zebra, Z as in Zulu), other letters have no predominant mnemonics associated with them (e.g. L, P, R and S). This makes the creation of a suitable generic language model not only very difficult, but virtually impossible. As such, communicating language to a speech recognition software application still produces a relatively high number of errors and not only do these errors tend to create frustration in frequent users, but they also tend to be discouraging to novice users as well, possibly resulting in the user refusing to continue employing the voice recognition application.