The reality of a speech-enabled home or other environment—that is, one in which a user need only speak a query or command out loud and a computer-based system will field and answer the query and/or cause the command to be performed—is upon us. A speech-enabled environment (e.g., home, workplace, school, etc.) can be implemented using a network of connected microphone devices distributed throughout the various rooms or areas of the environment. Through such a network of microphones, a user has the power to orally query the system from essentially anywhere in the environment without the need to have a computer or other device in front of him/her or even nearby. For example, while cooking in the kitchen, a user might ask the system “how many milliliters in three cups?” and, in response, receive an answer from the system, e.g., in the form of synthesized voice output. Alternatively, a user might ask the system questions such as “when does my nearest gas station close,” or, upon preparing to leave the house, “should I wear a coat today?”
Further, a user may ask a query of the system, and/or issue a command, that relates to the user's personal information. For example, a user might ask the system “when is my meeting with John?” or command the system “remind me to call John when I get back home.”
In a speech-enabled environment, a user's manner of interacting with the system is designed to be primarily, if not exclusively, by means of voice input. Consequently, a system which potentially picks up all utterances made in the environment, including those not directed to the system, must have some way of discerning when any given utterance is directed at the system as opposed, e.g., to being directed an individual present in the environment. One way to accomplish this is to use a “hotword” (also referred to as an “attention word” or “voice action initiation command”), which by agreement is reserved as a predetermined term that is spoken to invoke the attention of the system.
In one example environment, the hotword used to invoke the system's attention is the word “Google.” Consequently, each time the word “Google” is spoken, it is picked up by one of the microphones, and is conveyed to the system, which performs speech recognition techniques to determine whether the hotword was spoken and, if so, awaits an ensuing command or query. Accordingly, utterances directed at the system take the general form [HOTWORD] [QUERY], where “HOTWORD” in this example is “Google” and “QUERY” can be any question, command, declaration, or other request that can be speech recognized, parsed and acted on by the system, either alone or in conjunction with a server over network.