Currently available speech recognition systems determine the beginning and end of utterances by responding to the presence and absence of only acoustic energy having a spectrum associated with the utterances. If a microphone associated with the speech recognition system is in an acoustically noisy environment including, for example, speakers other than the speaker whose voice is to be recognized or activated machinery, including telephones (particularly ringing telephones), the noise limits the system performance. Such speech recognition systems attempt to correlate the acoustic noise with words it has learned for a particular speaker, resulting in the speech recognition system producing an output that is unrelated to any utterance of the speaker whose voice is to be recognized. In addition, the speech recognition system may respond to the acoustic noise in a manner having an adverse effect on its speech learning capabilities.
We are aware that the prior art has considered the problems associated with an acoustically noisy environment by detecting acoustic energy and facial characteristics of a speaker whose voice is to be recognized. For example, Maekawa et al, U.S. Pat. No. 5,884,257, and Stork et al, U.S. Pat. No. 5,621,858, disclose voice recognition systems that respond to acoustic energy of a speaker, as well as facial characteristics associated with utterances by the speaker. In Maekawa et al., lip movement is detected by a visual system including a light source and light detector. The system includes a speech period detector which derives a speech period signal by detecting the strength and duration of the movement of the speaker's lips. The system also includes a voice recognition system and an overall judgment section which determines the content of an utterance based on the acoustic energy in the utterance and movement of the lips of the speaker. In Stork et al., lip, nose and chin movement are detected by a video camera. Output signals of a spectrum analyzer responsive to acoustic energy and a position vector generator responsive to the video camera supply signals to a speech classifier trained to recognize a limited set of speech utterances based on the output signals of the spectrum analyzer and position vector generator.
In both Maekawa et al. and Stork et al., complete speech recognition is performed in parallel to image recognition. Consequently, the speech recognition processes of these prior art devices would appear to be somewhat slow and complex, as well as require a significant amount of power, such that the devices do not appear to be particularly well-suited as remote control devices for controlling equipment.