The recognition performance of automatic speech recognition systems can seriously degrade in noisy acoustic environments. One source of degradation includes environmental sounds that get mistaken as speech sounds, leading to errors that may include so-called insertions. Some of these insertions can be prevented by training specific non-speech models for various environmental sounds (like the slamming of doors, barking of dogs, etc.) and running these models in parallel to the actual speech and silence models during recognition. Another technique is to train so-called garbage models from generic speech. This allows garbage models to not only reject some non-speech events but also out-of-vocabulary speech.
However, both of the above approaches may lose in their effectiveness in situations where the acoustic environment includes another person that is speaking in-vocabulary words in the background. This scenario occurs, for example, when a dictation system is used on a mobile device.