Voice interfaces determine whether audio input includes a speech command, and how to behave in response. For instance, a person may use a voice interface of a virtual assistant on a smart phone to find out the weather by verbally asking: “What's the weather like today?” The virtual assistant receives and analyzes this question, and returns an answer. This process is resource intensive, so such processing is sometimes performed at a server that is remote from the device that initially received the utterance. Although offloading the processing to a server conserves computing resources at the device level, undesirable delay is also added. In addition, relying on a server for the processing causes issues when the server is unreachable, such as when a mobile device loses a network connection.
Wake word detection modules are sometimes implemented as a gate in front of additional processing components and are used to prevent audio input from being provided to an automatic speech recognizer that generates speech recognition results.
US 2018/0012593 describes a detection model used to determine whether a wake word has been uttered. The detection model uses features derived from an audio signal and contextual information to generate a detection score. If the detection score exceeds a threshold, automatic speech recognition and natural language understanding modules are activated so that the speech processing system can generate speech recognition results.
U.S. Pat. No. 9,098,467 describes a voice-controlled device that operates in at least two states. In a first state, a microphone captures sound, and the sound is processed by an automatic speech recognition component. The results of automatic speech recognition are, in turn, compared to a wake word. If a wake word is detected, the device transitions into a second state in which the device provides audio signals to a network-based computing platform that identifies commands from the speech indicated by the audio signals using automatic speech recognition.