Speech recognition has the potential to provide a significant leap in the application of computing technology. One of the barriers in the adoption of speech recognition is its inability to distinguish the relevant spoken commands intended for the computer from the otherwise irrelevant speech common throughout the day, such as passing conversations, muttering, and background conversation. As a result, most speech recognition systems require the user to continuously indicate to the computer when to start or stop listening, so that the system does not interpret speech intended for other listeners.
Humans, however, are quite adept at determining what speech is directed at them, and use a number of techniques to guide them in this, such as:
1. Specific keywords (such as our names); PA1 2. Body contact (such as a tap on the shoulder); PA1 3. Proximity of the noise (relative volume); and PA1 4. Visual clues (such as establishing eye contact, or pointing while one is moving their mouth).
In order to provide speech recognition systems with a human-like level of functionality, speech user interfaces have thus far focused on the first two techniques mentioned above. For instance, analogous to item 1 above, many speech recognition engines or units provide the ability to specify an "attention phrase" to wake up the computer and a "sleep" phrase to force an end to speech recognition. Most interface paradigms also provide a "toggle to talk" button, similar to a tap on the shoulder. These approaches alone, however, have limitations. Attention words are often missed, taking considerable time to eventually turn on or off speech recognition. Toggle to talk buttons require user proximity--undermining speech's inherent advantage of operating without having to be in physical contact with the speech recognition system.
Another problem with speech recognition systems is the inability of a speech recognition system to hone in on a specific audio source location. Recent microphone array research has, however, yielded the ability to hone in on a specific audio source location, thus providing the ability to filter extraneous, irrelevant sounds from the input audio stream. For example, using two microphones, one on each side of a speech recognition system (such as on the left and right side of the monitor of a PC-based system), background noise can be eliminated by using the microphone array to audially narrow into the words emanating from the user's mouth. The speech recognition algorithm can thus obtain a much cleaner audio source to use, increasing both its accuracy and its robustness in harsh (i.e., real world) audio environments. A problem with the microphone arrays, however, is that the user rarely sits still making it difficult to determine the source point to hone in on. This is especially so when speech recognition is performed in non-traditional PC uses (such as in a living room to control a television). Worse yet, if the speech recognition is performed via a hand held pad, the microphone itself is also moving.
As described below, the present invention provides a variety of embodiments that address the limitations of speech recognition systems noted above.