Speech recognition in the current art is utilized as an input for computational and electronic devices to free the use of hands so that keyboards and other input devices are not necessarily required. Speech as an input modality has been increasing in popularity and is often deployed in electronic devices to reduce the amount of keyboard, mouse or touch events required to perform an action. In many cases, the ability to use speech as an input greatly simplifies the amount of alternate input; an example would be speaking a search query instead of typing it. In other cases, the amount of speech required to perform an action is long and users do not like interacting with computational and electronic devices in this way. An example would be stringing multiple commands together by voice command by asking the system to select a specific object and perform an action with it or to create something and tell the system where to create it.
The problems with managing and efficiently handling multiple modalities of user inputs into devices and systems increases significantly when a user is managing complex multiple object types and menu and/or command hierarchies while interacting with complex systems that may contain large interactive displays, multi-user inputs, and busy collaborative environments.
Traditionally, methods in the prior art utilize a wake word such as used in auto-assistants and computer driven voice command systems. The utilization of a wake word to is to create a waking trigger event to capture and then act on the audio dialog that follows the wake word by parsing and identifying commands that are relevant to the device. The use of a “wake word” to trigger the start of the speech recognition adds an additional word to speak that is not relevant to the actions required which adds overhead to the interactive workflow the user wants to accomplish.
Speech input also has limitations when it comes to additional context for a command. A user might use speech to invoke a command, but what the command should be applied to, or where the end result should be, is either not present or the system needs to have additional input to properly satisfy the intent of the user.
Touch events for computers and or command based systems such as in vehicle GPS and audio systems require a touch event to tell the device that it should be expecting either further touch commands and or voice commands. In speech deployments where physical buttons or software interface buttons are used to initiate the ASR, this type of trigger does not lend itself to large displays or to multitasking environments due to the difficulty in reaching for them when interacting with large interactive surfaces and do not permit a trigger to be anywhere on the graphical user interface. The systems in the prior art typically have preassigned touch buttons to trigger the touch speech interaction which limits the flexibility to allow touch speech interactions in dynamic graphical and multitasking environments.
A draw back that may be present in both scenarios is that a triggering event is needed, which wakes the device to listen, then initialize and then look for the commands, which reduces the utility and efficiency of the devices in the prior art to handle and anticipate complex multimodal commands that happen in dynamic environments, with single or multiple users of complex interactive systems.
Patent Application No. US20020077830 A1 describes a process for activating speech recognition in a terminal, and includes automatically activating speech recognition when the terminal is used, and turning the speech recognition off after a time period has elapsed after activation. The process also takes the context of the terminal into account when the terminal is activated and defines a subset of allowable voice commands which correspond to the current context of the device.
Patent Application No. US20100312547 A1 describes techniques and systems for implementing contextual voice commands. On a device, a data item in a first context is displayed. On the device, a physical input (selecting the displayed data item in the first context) is received. On the device, a voice input that relates the selected data item to an operation in a second context is received. The operation is performed on the selected data item in the second context.
Patent Application No. US20140222436 A1 discloses a method for operating a voice trigger. In some implementations, the method is performed at an electronic device including one or more processors and memory storing instructions for execution by the one or more processors. The method includes receiving a sound input. The sound input may correspond to a spoken word or phrase, or a portion thereof. The method includes determining whether at least a portion of the sound input corresponds to a predetermined type of sound, such as a human voice. The method includes, upon a determination that at least a portion of the sound input corresponds to the predetermined type, determining whether the sound input includes predetermined content, such as a predetermined trigger word or phrase.
The present invention is intended to overcome one or more of the problems discussed above.