Humans may engage in human-to-computer dialogs with interactive software applications referred to herein as “automated assistants” (also referred to as “digital agents,” “chatbots,” “interactive personal assistants,” “intelligent personal assistants,” “assistant applications,” “conversational agents,” etc.). For example, humans (which when they interact with automated assistants may be referred to as “users”) may provide commands and/or requests to an automated assistant using spoken natural language input (i.e., utterances), which may in some cases be converted into text and then processed, and/or by providing textual (e.g., typed) natural language input. An automated assistant responds to a request by providing responsive user interface output, which can include audible and/or visual user interface output.
As mentioned above, many automated assistants are configured to be interacted with via spoken utterances. To preserve user privacy and/or to conserve resources, a user must often explicitly invoke an automated assistant before the automated assistant will fully process a spoken utterance. The explicit invocation of an automated assistant typically occurs in response to certain user interface input being received at a client device. The client device includes an assistant interface that provides, to a user of the client device, an interface for interfacing with the automated assistant (e.g., receives spoken and/or typed input from the user, and provides audible and/or graphical responses), and that interfaces with one or more additional components that implement the automated assistant (e.g., remote server device(s) that process user inputs and generate appropriate responses).
Some user interface inputs that can invoke an automated assistant via a client device include a hardware and/or virtual button at the client device for invoking the automated assistant (e.g., a tap of a hardware button, a selection of a graphical interface element displayed by the client device). Many automated assistants can additionally or alternatively be invoked in response to one or more spoken invocation phrases, which are also known as “hot words/phrases” or “trigger words/phrases”. For example, a spoken invocation phrase such as “Hey Assistant,” “OK Assistant”, and/or “Assistant” can be spoken to invoke an automated assistant.
Often, a client device that includes an assistant interface includes one or more locally stored models that the client device utilizes to monitor for an occurrence of a spoken invocation phrase. Such a client device can locally process received audio data utilizing the locally stored model, and discards any audio data that does not include the spoken invocation phrase. However, when local processing of received audio data indicates an occurrence of a spoken invocation phrase, the client device will then cause that audio data and/or following audio data to be further processed by the automated assistant. For instance, if a spoken invocation phrase is “Hey, Assistant”, and a user speaks “Hey, Assistant, what time is it”, audio data corresponding to “what time is it” can be processed by an automated assistant based on detection of “Hey, Assistant”, and utilized to provide an automated assistant response of the current time. If, on the other hand, the user simply speaks “what time is it” (without first speaking an invocation phrase), no response from the automated assistant will be provided as a result of “what time is it” not being preceded by an invocation phrase.
Although models exist for monitoring for occurrence of a spoken invocation phrase, many such models suffer from one or more drawbacks. For example, some models may perform poorly in environments with strong background noise (e.g., noise from a television, from playing music, from other conversations). For instance, some models may lack desired robustness and/or accuracy in the presence of background noise. This can lead to a failure to detect an actually spoken invocation phrase and/or can lead to errant detection of an invocation phrase.