Text entry on relatively small mobile devices, such as cellular telephones and personal digital assistants is growing in popularity due to an increase in use of applications for such devices. Some such applications include electronic mail (e-mail) and short message service (SMS).
However, mobile phones, personal digital assistants (PDAs), and other such mobile devices, in general, do not have a keyboard which is as convenient as that on a desktop computer. For instance, mobile phones tend to have only a numeric keypad on which multiple letters are mapped to the same key. Some PDAs have only touch sensitive screens that receive inputs from a stylus or similar item.
Thus, such devices currently provide interfaces that allow a user to enter text, through the numeric keypad or touch screen or other input device, using one of a number of different methods. One such method is a deterministic interface known as a multi-tap interface. In the multi-tap interface, the user depresses a numbered key a given number of times, based upon which corresponding letter the user desires. For example, when a keypad has the number “2” key corresponding to the letters “abc”, the keystroke “2” corresponds to “a”, the keystrokes “22” correspond to “b”, the keystrokes “222” correspond to “c”, and the keystrokes “2222” correspond to the number “2”. In another example, the keystroke entry 8 44 444 7777 would correspond to the word “this”.
Another known type of interface is a predictive system and is known as the T9 interface by Tegic Communications. The T9 interface allows a user to tap the key corresponding to a desired letter once, and uses the previous keystroke sequence to predict the desired word. Although this reduces the number of key presses, this type of predictive interface suffers from ambiguity that results from words that share the same key sequences. For example, the key sequence “4663” could correspond to the words “home”, “good”, “gone”, “hood”, or “hone”. In these situations, the interface displays a list of predicted words generated from the key sequence and the user presses a “next” key to scroll through the alternatives. Further, since words outside the dictionary, or outside the vocabulary of the interface, cannot be predicted, T9-type interfaces are often combined with other fallback strategies, such as multi-tap, in order to handle out of vocabulary words.
Some current interfaces also provide support for word completion and word prediction. For example, based on an initial key sequence, of “466” (which corresponds to the letters “goo”) one can predict the word “good”. Similarly, from an initial key sequence, “6676” (which corresponds to the letters “morn”) one can predict the word “morning”. Similarly, one can predict the word “a” as the next word following the word sequence “this is” based on an n-gram language model prediction.
None of these interfaces are truly susceptible of any type of rapid text entry. In fact, novice users of these methods often achieve text entry rates of only 5-10 words per minute.
In order to increase the information input bandwidth on such communication devices, some devices implement speech recognition. Speech has a relatively high communication bandwidth which is estimated at approximately 250 words per minute. However, the bandwidth for text entry using conventional automatic speech recognition systems is much lower in practice due to the time spent by the user in checking for, and correcting, speech recognition errors which are inevitable with current speech recognition systems.
In particular, some current speech based text input methods allow users to enter text into cellular telephones by speaking an utterance with a slight pause between each word. The speech recognition system then displays a recognition result. Since direct dictation often results in errors, especially in the presence of noise, the user must select mistakes in the recognition result and then correct them using an alternatives list or fallback entry method.
Isolated word recognition requires the user to speak only one word at a time. That one word is processed and output. The user then corrects that word. Although isolated word recognition does improve recognition accuracy, an isolated word recognition interface is unnatural and reduces the data entry rate over that achieved using continuous speech recognition, in which a user can speak an entire phrase or sentence at one time.
However, error correction in continuous speech recognition presents problems. Traditionally, speech recognition results for continuous speech recognition have been presented by displaying the best hypothesis for the entire phrase or sentence. To correct errors, the user then selects the misrecognized word and chooses an alternative from a drop down list. Since errors often occur in groups and across word boundaries, many systems allow for correcting entire misrecognized phrases. For example the utterance “can you recognize speech” may be incorrectly recognized as “can you wreck a nice beach”. In this case, it is simply not possible to correct the recognition a word at a time due to incorrect word segmentation. Thus, the user is required to select the phrase “wreck a nice beach” and choose an alternate for the entire phrase.
While such an approach may work well when recognition accuracy is high and a pointing device such as a mouse, is available, it becomes cumbersome on mobile devices without a pointer and where recognition accuracy cannot be assumed, given typically noisy environments and limited processor capabilities. On a device with only hardware buttons, a keypad or a touch screen, or the like, it is difficult to design an interface that allows users to select a range of words for correction, while keeping keystrokes to a reasonable number.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.