Although speech recognition has been around for decades, the quality of speech recognition software and hardware has only recently reached a high enough level to appeal to a large number of consumers. One area in which speech recognition has become very popular in recent years is the smartphone and tablet computer industry. Using a speech recognition-enabled device, a consumer can perform such tasks as making phone calls, writing emails, and navigating with GPS using only voice commands.
Speech recognition in such devices is far from perfect, however. When using a speech recognition-enabled device for the first time, the user may need to “train” the speech recognition software to recognize his or her voice. Even after training, however, the speech recognition functions may not work well in all sound environments. For example, the presence of background noise can decrease speech recognition accuracy.
In an always-on audio (AOA) system, a speech recognition-enabled device continuously listens for the occurrence of a trigger phrase, which also referred to as a “hotword”. The trigger phrase, when detected, alerts the device that the user is about to issue a voice command or a sequence of voice commands, which are then processed by a speech recognition engine in the device. The system, by continuously listening for the occurrence of a trigger phrase, frees the user from having to manually signal to the device that the voice command mode is being entered, eliminating the need for an action such as pressing a physical button or a virtual button or control via the device touch screen.
In the AOA system, it is advantageous for the user to train the trigger phrase recognizer for the user's voice. This allows the trigger phrase recognizer to adapt the trigger phrase recognition models to the user's voice, thus improving the trigger phrase recognizer accuracy, and also to employ speaker recognition to help reject the trigger phrase when it is spoken by a person other than the user. For these advantages to be realized the user must go through the enrollment process to adapt the trigger phrase model to the user's voice. The enrollment process, in an example, involves the user being prompted to say the trigger phrase multiple times (e.g., three times), while being in an acoustically quiet environment. The three utterances of the trigger phrase, captured by a microphone in the device, are digitally sampled, and used for trigger phrase model training. For the training to yield high quality trigger phrase models tailored to the user's voice, the three instances of the trigger phrase recordings, made by the user in the enrollment process, should ideally have low background noise level, which has preferably stationary (i.e., not fluctuating with respect to time) characteristics, and not include tongue clicks, device handling noise, or other spurious non-speech sounds, such as pops, or clicks. If the enrollment recordings of the trigger phrase do not satisfy the above requirements, the trigger phrase models adapted to the user will be of poor quality, resulting in degraded trigger phrase recognition accuracy.