Generally, speech recognition systems are highly sensitive to background noise. Such noise can include background speakers, keyboard strikes and ringing phones, as well as breathiness from the actual speaker. Without the use of a close-talking microphone (i.e., a headset or handheld microphone, wherein the microphone is placed close to the mouth) the problem may become unmanageable, and even when a close-talking microphone is used, problems are still possible.
Conventional efforts have tended to lie on two fronts, namely, microphone design and speech recognition confidence scoring. For instance, “beamforming” microphones for hands-free systems have been designed to separate desired speech from interference based on direction of acoustic arrival. Here, signals arriving from directions or areas other than the “main beam”, or the area of from which the voice originates, are suppressed. IBM has such a product, for providing hands free desktop array microphone input to the ViaVoice speech recognizer (Millennium Pro Elite). The Philips corporation also has similar products (see [http://]www.research.philips.com/password/pw4/pw4—10b.html).
Much work has also been done on speech recognition confidence scoring, including the publication “Estimating Confidence Using Word Lattices”, Proceedings of Eurospeech, Rhodes, September 1997, pp. 827-830. Such approaches, however, do not take into account the strength of the audio signal.
What we are trying to do here is merge the two ideas, thus requiring communication between what the signal processor is seeing and what the acoustic models are trying to do with the signal. Problems have oft been encountered in connection with the conventional efforts described above. For instance, among the limitations of array microphone design is the sensitivity of any corresponding acoustic models that are used, since even very low energy signals often exhibit a strong acoustic model match. Thus, though robust acoustic models are certainly desirable for low recognition error, it is often the case that acoustic models, when presented with low energy signals (e.g., as the result of imperfect suppression by the beamform array or close talking microphone), will produce nonsensical words. The acoustic models also tend to model “disfluencies” which can swallow noises such as lip smacks, breath, or other non-speech events. Low-level background speech and constant frequency hums from machinery such as air conditioners does not fall under the disfluency category. Thus, these types of noises will also tend to produce nonsensical words.
Accordingly, in view of the foregoing, a need has been recognized in connection with improving the performance of speech recognition in noisy environments.