In recent years, voice command-and-control has become a popular feature on electronic devices such as smartphones, tablets, media streaming devices, smart speakers, and so on. Generally speaking, this feature allows a user to interact with the device in a hands-free manner in order to access information and/or to control operation of the device. For example, according to one implementation, the user can say a predefined trigger phrase, immediately followed by a query or command phrase. The device will typically be listening for the predefined trigger phrase (using, e.g., conventional phrase spotting/speech recognition techniques) in an always-on, low-power modality. Upon detecting an utterance of the trigger phrase, the device can cause the following query or command phrase to be processed, either locally on the device or remotely in the cloud. The device can then cause an appropriate action to be performed based on the content of the query or command phrase and can return a response to the user.
One limitation with existing voice command-and-control systems is that they rely solely on audio information to detect the trigger phrase, and thus can be confused by background noise, multiple individuals speaking simultaneously, and other factors. This, in turn, can cause such systems to generate a significant number of false accepts and/or false rejects over time. A “false accept” in this context occurs when the trigger phrase is detected although it has not been uttered, and a “false reject” occurs when the trigger phrase is not detected although it has been uttered. Accordingly, it would be desirable to have techniques that improve the accuracy of voice command-and-control.