As is known in the art, computer speech recognition (a.k.a., automatic speech recognition or ASR) is the process of automatically converting spoken words into text by a computer. Illustrative applications of ASR include speech transcription, speech translation, voice control of devices, etc. Speech recognition systems operate by matching the acoustics with acoustic signatures of words. These acoustic signatures, also known as acoustic models, are trained using a large amount of training data. Generally, this training data is collected from a large number of different speakers to make an ASR system that can recognize audio from a wide range of speakers (a.k.a. speaker independent ASR system). It is known that such generic acoustic models, though performing well on a wide range of users, may not perform as well on a given user compared to an acoustic model trained over just that user. To match the acoustic model to a specific user, in practice, an ASR system may adapt its generic acoustic model using a small amount of audio data from a target speaker to create a speaker specific acoustic model that performs significantly better than a generic acoustic model. This process is referred to as acoustic model adaptation or speaker adaptation.
Acoustic model adaptation can be performed as supervised or unsupervised. In both cases, the ASR system uses audio files from the target user(s) and corresponding transcriptions. In supervised adaptation, the correctness of the transcription is verified by a human, explicitly or implicitly. In unsupervised model adaptation, the system uses a transcription that is automatically generated without explicit human verification. In unsupervised adaptation, the transcription may be incorrect, and adapting on incorrect transcription can potentially degrade performance. Minimizing incorrect adaptation is one challenge for unsupervised adaptation.
Today, one application of speech recognition technology is to allow voice commands to “wake up” a “sleeping” device. Some of today's devices, such as smartphones and televisions, are designed to enter a sleep mode to conserve power when not actively used for some period of time. Once such devices go into sleep mode, they must be first “woken up” to perform a task, such as making a call in case of a smartphone, or showing a particular channel in case of a television. Traditionally, a device is woken up using a press of a button. In voice-based wakeup, a device can be woken up using a voice command. The advantage of using voice to wake up a device is that the user does not need to physically locate and touch the device. For example, for a television, the user can just say “Wake up TV” and the television wakes up, and then the user can say “Show CNN” without the user having to power on the television explicitly. In this case, “Wake up TV” is the wakeup phrase.
In a voice-based wakeup task the device, though sleeping, is constantly listening to the ambient audio for a pre-specified phrase or set of wakeup phrases. When the device detects a wakeup phrase, it wakes up and is ready to perform tasks.
There are a number of possible outcomes in a voice-based wakeup system:                1) Correct Accept (CA), in which the user speaks a wakeup phrase, and the device correctly recognizes it.        2) False Accept (FA) is when non-wakeup audio is recognized as a wakeup, and the device falsely wakes up.        3) Correct Reject (CR) is when the non-wakeup audio is correctly rejected.        4) False Reject (FR) is when the system fails to recognize a wakeup request from the user.        
One of the challenges of voice-based wakeup systems is that the ratio of wakeup audio to background can be very small. For example, in a typical scenario, a system can be listening for several hours, before a single wakeup is issued. For the single instance of wakeup audio that needs to be detected, there are several hours of background audio that must be rejected. Such voice-based wakeup systems are tuned to reject aggressively to minimize false accepts (FAs). Anything that does not closely match the acoustic signature of the wakeup phrase is rejected. However, this can potentially result in high false-reject (FR) rates, especially for non-native users, or in noisy conditions, as the acoustic signature of the wakeup phrase may not closely match the one in the acoustic model.
It is known that acoustic model adaptation to the target user yields a significant reduction in FRs. In many current systems using voice-based wakeup, acoustic model adaptation takes place during a supervised user enrollment session. In supervised enrollment, the system prompts the user to speak a particular wakeup phrase a few times (typically three). Using the audio examples provided by the user, the system adapts the recognition models, improving the wakeup performance significantly for that user. This adaptation is supervised in the sense that the user speaks the phrase prompted by the system. (In addition, an automatic rejection scheme will prevent the system from triggering on non-speech events.)
However, a supervised enrollment method such as this has various limitations. For example, it requires explicit user interaction with the device, which may not be preferred by all users. On some devices, the interface required may not be present. In addition, supervised enrollment is feasible only on a small set of phrases. Enrolling on many phrases to obtain the gain from speaker adaptation on these may be relatively user unfriendly and time consuming. Further, supervised enrollment often happens in a single session and captures only a single acoustic environment, and the gains are greatest for matched acoustic conditions, i.e., if a user enrolled using a specific prosody, or in specific noise conditions, then the enrolled models will not perform as well on mismatched conditions. That is, if the enrollment happened in a quiet environment and the user tries to wake the system in a noisy car, the wake-up may not work as well as in a quiet, clean environment. Supervised enrollment may also be clumsy when multiple users need to be enrolled such as for a TV, where multiple family members may use the system.