Computing devices are commonly used to help users perform any of a variety of desired tasks. In some cases, the computing devices control equipment that is able to perform the desired task. For example, some computing devices are configured to turn on or off a light switch, adjust playback on an audio device, initiate a call on a mobile handset, adjust the temperature of an air conditioning unit, and the like. Voice control of such computing devices may be particularly helpful and convenient by allowing a user to perform a task without having to use his or her hands, or physically activate physical elements of a user interface (e.g., switches, keyboards, buttons, etc.). In some cases, computing devices with voice control listen for one or more keywords before performing the desired task. For example, a computing device may listen for the keywords “lights on” or “lights off” to turn on or off a light switch, “play song” to activate an audio player, “call” to initiate a phone call, “increase temperature” or “decrease temperature” to control an air conditioning unit, and the like.
Generally, such computing devices identify a keyword using various models that include information relevant to the control of particular devices. Such models can include a keyword model and a background model. The keyword model can include a sequence of one or more states (e.g., hidden Markov model (HMM) states, etc.) that together represent the keyword. Comparing an utterance with the keyword model (e.g., by aligning feature vectors derived from the portion with the states of the keyword model, or other process, as described below) yields a score that represents how likely the utterance corresponds with the keyword. Similarly, the background model can include a sequence of one or more states that together represent words other than the keyword, as described further below. Comparing the utterance with the background model (e.g., by aligning feature vectors derived from the utterance with the states of the background model, or other process, as described below) yields a score that represents how likely the utterance corresponds with a generic word. The computing device can compare the two scores to determine whether the keyword was spoken.
In some cases, this approach adequately identifies keywords. For example, a word that is clearly different than the keyword is unlikely to be identified as the keyword because the degree of similarity between the word and the keyword may be greater than the degree of similarity between the word and keyword generic word. However, in other cases, the computing device will falsely identify certain words as a keyword. For example, a word that is acoustically similar to the keyword might be erroneously identified as a keyword when the degree of similarity between the word and a generic word is greater than the degree of similarity between the word and the keyword.