The present invention relates to speech recognition, and in particular, to systems and methods for word spotting in speech recognition to create hands-free control of consumer and other electronics products.
Speech recognition systems are used to recognize acoustic inputs, such as spoken words or phrases, for example. Consumer electronic devices have begun embracing speech recognition as a means to command and control the product. Some products such as Bluetooth Headsets with speech interfaces, utilize a voice user interface (VUI) to interact with the product. However, such consumer electronic speech recognition products always require a button press to trigger the recognizer.
It is desirable to have a completely hands-free approach to trigger the recognizer to listen. This would require the recognizer to be on and listening to all spoken words in its environment for that single trigger word that activates the recognizer to begin listening for words to recognize and understand. Several voice activated light switch products have come to market, but the performance was never acceptable to enable mass volumes.
Hands-free voice triggers can make two types of errors. They cannot recognize the right command (false reject), or they can mis-recognize the wrong command (false accept). Speech recognition technologies can make tradeoffs between the false accept and false reject rate, but speech technologies for consumer products have generally been unable to reach a happy medium that is acceptable to users.
The hands-free voice trigger technology has been particularly difficult to master because it must work in noisy home environments (e.g. tv on, people talking, music playing, babies screaming, etc.), and while the noise is loud, the listening device is often quite far from the user, creating a very low S/N ratio.
Some prior art techniques perform speech recognition for an internet appliance using a remotely located speech recognition application. For example, speech recognition may be performed on a server, which is accessed through storing and sending the voice data. It assumes it is not cost effective to have the speech recognizer embedded in the consumer device for cost reasons. This approach thus requires some sort of manual request for internet access or recording the speech to transmit over the internet, and therefore cannot create truly hands-free devices.
In some speech recognition systems, acoustic inputs may be processed against acoustic models of sounds to determine which sounds are in the acoustic input. FIG. 1 illustrates prior art approaches to speech recognition. In FIG. 1, the acoustic input 101 in the top row is illustrated as silence (“sil”) followed by the sound “yi ye yes m”. In this example, it may be desirable for the recognizer to recognize “yes”—the target sound. According to one existing approach for performing recognition, a recognizer is configured with a model (or grammar) that indicates what sounds to recognize. For example, the word to be recognized may be modeled at 102 as “sil” followed by the sounds “y” and “e” and “s” and ending in “sil” (i.e., “yes”). This is referred to as a non-word spotting approach. However, this approach may not provide satisfactory results. As the recognition process receives the acoustic input, it listens for “y”, “e”, “s” and tries to line up the received sounds with the sounds it is looking for as it attempts to recognize the target sound. As this example illustrates, the recognizer may be misled by the input signal and may improperly classify the received sounds, which may result in a very low confidence in the final recognized result, or no recognized result at all. In FIG. 1, the backslash lines represent low confidence recognition, the vertical lines represent moderate confidence recognition, and the forward-slash lines represent high confidence recognition. A recognition process using model 102 may have very low confidence results, with only the “s” being recognized with high confidence, for example.
One approach for recognizing sounds buried in other sounds is referred to as word spotting. In a typical word spotting recognition process, the target sound may be modeled as any sound (or background sounds, “any”) followed by “y” and “e” and “s” and ending in “any”. One challenge with this approach is building an accurate background model for “any”. However, creating an accurate background model may require a model that covers many sounds, including noise, silence, and many spoken sounds, for example. As illustrated in FIG. 1, the background model “any” may perform poorly, and when the target word is spoken (here, “yes”) it may only be recognized with moderate confidence.
Accordingly, it would be advantageous to have a recognition system and method for more accurately performing word spotting recognition.