Technical Field
The invention relates to user entry of information into a system with an input device. More particularly, the invention relates to speech recognition combined with disambiguating systems for text input.
Description of the Prior Art
For many years, portable computers have been getting smaller and smaller. The principal size-limiting component in the effort to produce a smaller portable computer has been the keyboard. If standard typewriter-size keys are used, the portable computer must be at least as large as the standard keyboard. Miniature keyboards have been used on portable computers, but the miniature keyboard keys have been found to be too small to be manipulated easily or quickly by a user. Incorporating a full-size keyboard in a portable computer also hinders true portable use of the computer. Most portable computers cannot be operated without placing the computer on a flat work surface to allow the user to type with both hands. A user cannot easily use a portable computer while standing or moving.
Presently, a tremendous growth in the wireless industry has spawned reliable, convenient, and very popular mobile devices available to the average consumer, such as cell phones, PDAs, etc. Thus, handheld wireless communications and computing devices requiring text input are becoming smaller still. Recent advances in cellular telephones and other portable wireless technologies have led to a demand for small and portable two-way messaging systems. Most wireless communications device manufacturers also desire to provide to consumers devices that can be operated by a user with the same hand that is holding the device.
Speech recognition has long been expected to be the best means for text input, both as an enhancement to productivity on the desktop computer and as a solution for the size limitations of mobile devices. A speech recognition system typically includes a microphone to detect and record the voice input. The voice input is digitized and analyzed to extract a speech pattern. Speech recognition typically requires a powerful system to process the voice input. Some speech recognition systems with limited capability have been implemented on small devices, such as command and control on cellular phones, but for voice-controlled operations a device only needs to recognize a few commands. Even for such a limited scope of speech recognition, a small device may not have satisfactory speech recognition accuracy because voice patterns vary dramatically across speakers and environmental noise adds complexity to the signal.
Suhm et al discuss a particular problem of speech recognition in the paper Multimodal Error Correction for Speech User Interfaces, in ACM Transactions on Computer-Human Interaction (2001). The “repair problem” is that of correcting the errors that occur due to imperfect recognition. They found that using the same modality (re-speaking) was unlikely to correct the recognition error, due in large part to the “Lombard” effect where people speak differently than usual after they are initially misunderstood, and that using a different modality, such as a keyboard, was a much more effective and efficient remedy. Unfortunately, mobile devices in particular lack the processing power and memory to offer full speech recognition capabilities, resulting in even higher recognition errors, and lack the physical space to offer full keyboard and mouse input for efficiently correcting the errors.
Disambiguation
Prior development work has considered use of a keyboard that has a reduced number of keys. As suggested by the keypad layout of a touch-tone telephone, many of the reduced keyboards have used a 3-by-4 array of keys. Each key in the array of keys contains multiple characters. There is therefore ambiguity as a user enters a sequence of keys because each keystroke may indicate one of several letters. Several approaches have been suggested for resolving the ambiguity of the keystroke sequence. Such approaches are referred to as disambiguation.
Some suggested approaches for determining the correct character sequence that corresponds to an ambiguous keystroke sequence are summarized by J. Arnott, M. Javad in their paper Probabilistic Character Disambiguation for Reduced Keyboards Using Small Text Samples, in the Journal of the International Society for Augmentative and Alternative Communication.
T9® Text Input is the leading commercial product offering word-level disambiguation for reduced keyboards such as telephone keypads, based on U.S. Pat. No. 5,818,437 and subsequent patents. Ordering the ambiguous words by frequency of use reduces the efficiency problems identified in earlier research, and the ability to add new words makes it even easier to use over time. Input sequences may be interpreted simultaneously as words, word stems and/or completions, numbers, and unambiguous character strings based on stylus tap location or keying patterns such as multi-tap.
T9 and similar products are also available on reduced keyboard devices for languages with ideographic rather than alphabetic characters, such as Chinese. These products typically take one of two approaches: basic handwritten strokes or stroke categories are mapped to the available keys, and the user enters the strokes for the desired character in a traditional order; or a phonetic alphabet is mapped to the keys and the user enters the phonetic spelling of the desired character. In either case, the user then has to locate and select the desired character among the many that match the input sequence. The input products often benefit from the context of the previously entered character to improve the ordering of the most likely characters displayed, as two or more ideographic characters are often needed to define a word or phrase.
Unfortunately, mobile phones are being designed with ever-smaller keypads, with keys that are more stylish but also more difficult for typing quickly and accurately. And disambiguating ambiguous keystroke sequences could benefit from further improvements. For example, the syntactic or application context is not typically taken into account when disambiguating an entered sequence or when predicting the next one.
Another commonly used keyboard for small devices consists of a touch-sensitive panel on which some type of keyboard overlay has been printed, or a touch-sensitive screen with a keyboard overlay displayed. Depending on the size and nature of the specific keyboard, either a finger or a stylus can be used to interact with the panel or display screen in the area associated with the key or letter that the user intends to activate. Due to the reduced size of many portable devices, a stylus is often used to attain sufficient accuracy in activating each intended key. The small overall size of such keyboards results in a small area being associated with each key so that it becomes quite difficult for the average user to type quickly with sufficient accuracy.
A number of built-in and add-on products offer word prediction for touch-screen keyboards like those just mentioned. After the user carefully taps on the first letters of a word, the prediction system displays a list of the most likely complete words that start with those letters. If there are too many choices, however, the user has to keep typing until the desired word appears or the user finishes the word. Switching visual focus between the touch-screen keyboard and the list of word completions after every letter tends to slow text entry rather than accelerate it.
The system described in U.S. Pat. No. 6,801,190 uses word-level auto-correction to resolve the accuracy problem and permit rapid entry on small keyboards. Because tap locations are presumed to be inaccurate, there is some ambiguity as to what the user intended to type. The user is presented with one or more interpretations of each keystroke sequence corresponding to a word such that the user can easily select the desired interpretation. This approach enables the system to use the information contained in the entire sequence of keystrokes to resolve what the user's intention was for each character of the sequence. When auto-correction is enabled, however, the system may not be able to offer many word completions since it does not presume that the first letters are accurate, cannot determine whether the user is typing the entire word, and there may be many other interpretations of the key sequence to display.
Handwriting recognition is another approach that has been taken to solve the text input problem on small devices that have a touch-sensitive screen or pad that detects motion of a finger or stylus. Writing on the touch-sensitive panel or display screen generates a stream of data input indicating the contact points. The handwriting recognition software analyzes the geometric characteristics of the stream of data input to determine each character or word.
Unfortunately, current handwriting recognition solutions have many problems:
1) Handwriting is generally slower than typing;
2) On small devices, memory limitations reduce handwriting recognition accuracy; and
3) Individual handwriting styles may differ from those used to train the handwriting software.
It is for these reasons that many handwriting or ‘graffiti’ products require the user to learn a very specific set of strokes for the individual letters. These specific set of strokes are designed to simplify the geometric pattern recognition process of the system and increase the recognition rate. These strokes may be very different from the natural way in which the letter is written. This results in very low product adoption.
Handwriting on mobile devices introduces further challenges to recognition accuracy: the orientation of handwriting while trying to hold the device may vary or skew the input; and usage while on the move, e.g. the vibration or bumpiness during a bus ride, causes loss of contact with the touch-screen resulting in “noise” in the stream of contact points.
Therefore, current ambiguous and recognizer-based systems for text input, while compensating somewhat for the constraints imposed by small devices, have limitations that reduce their speed and accuracy to a level that users might consider unacceptable.
In Suhm's paper, “multimodal error correction” is defined as using an alternate (non-speech) modality to re-enter the entire word or phrase that was misrecognized. This is found to be more efficient than re-speaking in part because the speech modality has already been shown to be inaccurate. That the alternate input modality has its own recognition accuracy problems is considered by the user in deciding which modality to use next, but each of the modalities are operated independently in an attempt to complete the text entry task.
It would be advantageous to provide an apparatus and method for speech recognition that offers smart editing of speech recognition output.
It would be advantageous to provide an apparatus and method for speech recognition that maximizes the benefits of an alternate input modality in correcting recognition errors.
It would be advantageous to provide an apparatus and method for speech recognition that offers an efficient alternate input modality when speech recognition is not effective or desirable given the current task or environment.