Mobile devices have become ubiquitous in the everyday life of the general consumer. No longer are cellular phones, electronic personal data assistants, and Internet-connected hand-held devices reserved for the elite. As these devices become ingrained in consumers' daily routines, the use of the devices in situations in which safety, convenience and even appropriateness have become issues.
Users frequently need to interrupt or “barge in” during the operation of a software application in order to issue a command, pause or stop the application, and frequently need to do so without touching or looking directly at the device running the software. As one example, drivers routinely attempt to text, email and talk on their phone while driving. In other instances, the device may be out of reach (e.g., across the room, or inside a clothing pocket, backpack, briefcase or purse. The device may be purposefully hidden to avoid theft or snooping, such as when being used in a crowd or subway. The user may wish to avoid touching the device due to risks of dirtying, ruining or breaking the device, such as when jogging, biking, cooking, or performing manual labor. In some instances, the user may be visually impaired or otherwise physically challenged and need to rely on other means for interacting with the device.
The present state of the art, however, does not provide a simple and accurate method for a user to effectively “barge in” as a device and/or application on the device is in operation. While “hands-free” operation has made use of these devices somewhat more acceptable in certain instances, but the user experience is less than ideal.
Attempts to solve this challenge using voice-response and voice-recognition applications have not fully addressed the challenges users encounter during actual use of their devices. This situation may be exacerbated over speakerphone systems—such as are common in automobiles for hands-free cell phone use—because the mobile communication device microphone may be picking up both the voice of the user and other sounds. For example, speakerphone systems actually exacerbate the problem because their microphones do not discriminate among sounds in the acoustic environment, which sounds can include voices other than that of the user, as well as other background sounds such as road noise. Hence, “barging in” or “getting the attention” of a device by speaking a specifically predefined voice command is ineffective by virtue of the limited accuracy of conventional speech recognition.
Additionally, some mobile voice-controlled applications deliver audio content themselves, such as simulated speech (text-to-speech), music or video. Because many devices place the speaker used for delivering the audio content in close proximity to the microphone used for detecting voice commands, it is even more difficult for the device to hear and distinguish the user's voice over audio content it is delivering.
For “barging in” or “getting the attention” of a device via user voice command, the device must be constantly listening to all acoustic signals in the environment and attempting to detect a specifically predefined “barge in” command. Today's mobile devices and the applications that run on such devices perform much of their speech recognition using cloud-based services, so constantly listening can be prohibitively expensive for the user from a data usage standpoint under current mobile data plans offered by cell carriers. There is also a latency associated with transmitting every utterance and sound from the mobile device to the cloud for speech recognition processing, which makes the mobile device inadequately responsive to such spoken barge-in commands. Moreover, “barge-in” commands are useless if the data connection between the mobile device and the network is lost, and constant listening takes a dramatic toll on battery life.
What is needed, therefore, is a system and techniques that allow a user to effectively and reliably interrupt or “get the attention” of an application under a variety of acoustic conditions and in the presence of competing audio signals or noise, and without requiring access to a data network.