The problem of entering text into devices having small form factors (like cellular phones, personal digital assistants (PDAs), RIM Blackberry, the Apple iPod, and others) using multimodal interfaces (especially using speech) has existed for a while now. This problem is of specific importance in many practical mobile applications that include text-messaging (short messaging service or SMS, multimedia messaging service or MMS, Email, instant messaging or IM), wireless Internet browsing, and wireless content search.
Although many attempts have been made to address the above problem using “Speech Recognition”, there has been limited practical success. The inventors have determined that this is because existing systems use speech as an independent mode, not simultaneous with other modes like the keypad; for instance a user needs to push-to-speak (may be using a Camera button on a mobile device) to activate the speech mode and then speak a command or dictate a phrase. Further on, these systems simply allow for limited vocabulary command-and-control speech input (for instance say a song title or a name or a phrase to search for a ringtone etc.). Finally, these systems require the users to learn a whole new interface-design without offering providing them the motivation to do so.
In general, contemporary systems have several limitations. Firstly, when the vocabulary list is large (for instance 1000 isolated-words or 5000 small phrases as in names/commands or 20000 long phrases as in song titles) and so on, (a) these systems' recognition accuracies are not satisfactory even when they are implemented using a distributed client-server architecture, (b) the multimodal systems are not practical to implement on current hardware platforms, (c) the systems are not fast enough given their implementation, and (d) these systems' recognition accuracies degrade in the slightest of background noise.
Secondly, the activation (and de-activation) of the speech mode requires a manual control input (for instance a push-to-speak camera-button on a mobile device). The primary reason for doing so is to address problems arising due to background noises; using the popular push-to-speak method the system's microphone is ON only when the user is speaking; hence non-speech signals are filtered out.
Thirdly, these methods design a speech mode that is independent and different compared to existing designs of non-speech modes, thus introducing a significant behavioral change for users: (a) to input a Name (Greg Simon) on a mobile device from a contact list, a standard keypad interface may include “Menu-Insert-Select Greg Simon”; a standard voice-dial software requires the user to say “Call Greg Simon” which has a different interface-design (b) to input a phrase user enters the words of that phrase one by one by using either triple-tap or predictive-text-input; a speech dictation software requires the user to push-and-hold a button and speak the whole sentence, once again a different interface-design which requires users to say the entire message continuously.