There has long been a desire to have machines capable of responding to human speech, such as machines capable of obeying human commands and machines capable of transcribing human dictation. Such machines would greatly increase the speed and ease with which people communicate with computers and the speed and ease with which they record and organize their words and thoughts.
Due to recent advances in computer technology and speech recognition algorithms, speech recognition machines have begun to appear in the past several decades, and have become increasingly more powerful and less expensive. For example, the assignee of the present application has publicly demonstrated speech recognition software which runs on popular personal computers and which requires little extra hardware. This system is capable of providing speaker-dependent, discrete word recognition for vocabularies of up to two thousand words at any one time, and many of its features are described in U.S. patent application Ser. No. 797,249. This prior application (hereinafter referred to as application Ser. No. 797,249) which is entitled "Speech Recognition Apparatus and Method", is assigned to the assignee of the present application, and is incorporated herein by reference.
One of the problems encountered in most speech recognition systems is that of varying levels of background noise. Many speed recognition systems determine which portion of an audio signal contains speech to be recognized by using speech detection apparatus, such as the speech detection apparatus described in the above mentioned application Ser. No. 797,249. Many such speech detecting apparatuses compare the amplitude of an audio signal with amplitude thresholds to detect the start or end of an utterance to be recognized. Such methods work well when there is little background sound, or where the background sound is relatively constant in amplitude. But if the amplitude of the background sound either goes up or down relative to the level for which the start of utterance and end of utterance thresholds are set, the system is likely to make mistakes in detecting the beginning and end of utterances.
Changes in background sound also tend to decrease the reliability of speech recognition itself. Many speech recognition systems, such as that described in application Ser. No. 797,249, recognize words by comparing them to acoustic models of vocabulary words or of parts of vocabulary words. Such acoustic models usually contain information about the amplitude of the sounds they represent. Since background sounds are added to speech sounds which are spoken over them, changes in the background sound change the amplitudes of sounds head by the recognizer during speech, and thus can decrease the accuracy with which the recognizer matches speech sounds against the amplitude descriptions contained in their acoustic models.