I. Technical Field
Embodiments described herein relate to isolated word training and detection.
II. Background Art
For mobile phones and the wearables market, ultra-low power stand-by mode is critical. At the same time, these devices are increasingly becoming an always-on sensor-hub, collecting data from multiple device sensors to constantly monitor and determine location, temperature, orientation, etc. A wake-up feature, via which a device that has been placed in a very low power stand-by mode may be woken up by the utterance of a “wake-up” phrase, lends itself nicely to this environment as it too is an always-on feature and enables the user to wake up the device and enable further device capabilities via a voice interface, thus avoiding the need for cumbersome keyboard or touch interfaces. Some devices already have this wake-up phrase detection feature. For example, the Galaxy S4® by Samsung may be woken by the utterance of the phrase “Hello Galaxy”. Certain Google devices may be woken by the utterance of “OK Google Now”. However, each solution comes with a pre-loaded set of wake-up phrases. The users do not have the capability to use their own phrase. An investigation on discussion forums of the popularity of the wake-up feature found that most users wanted the ability to use their own phrase.
From a support stand-point, pre-loaded phrases pose other issues. Different device vendors may require their own unique wake-up phrases. Devices shipped to different countries or regions of the world will need to support wake-up phrases in various languages. The underlying speech recognition engine is typically Hidden Markov Model (HMM) based and requires hundreds of example utterances from dozens of speakers in order to properly train the model. FIG. 1 illustrates diagram of a typical training flow 100 for modeling a word or phrase. Supporting a user-selected wake-up phrase would require the user to tediously record from thirty up to one hundred example training utterances 102 of a word or phrase for typical offline HMM training 104 to generate a final HMM word model 106, and this is impractical; hence current solutions employ pre-loaded phrase support only. Since the wake-up phrases requested by different device vendors are not known ahead of time, each time there is a new device vendor to support, dozens of speakers must be recruited to do the recordings necessary to perform the offline training and obtain the HMM models. In addition, the logistics get even more complex if the phrase is in a foreign language.