Speech recognition is an important aspect of furthering man-machine interaction. The end goal in developing speech recognition systems is to replace the keyboard interface to computers with voice input. To this end, several systems have been developed; however, these systems typically concentrate on improving the transcription error rate on relatively clean data in a controlled and steady-state environment, i.e., the speaker would speak relatively clearly in a quiet environment. Though this is not an impractical assumption for applications such as transcribing dictation, there are several real-world situations where these assumptions are not valid, i.e., the ambient conditions are noisy or change rapidly or both. As the end goal of research in speech recognition is the universal use of speech-recognition systems in real-world situations (for e.g., information kiosks, transcription of broadcast shows, etc.), it is necessary to develop speech-recognition systems that operate under these non-ideal conditions. For instance, in the case of broadcast shows, segments of speech from the anchor and the correspondents (which are either relatively clean, or have music playing in the background) are interspersed with music and interviews with people (possibly over a telephone, and possibly under noisy conditions).
A speech recognition system designed to decode clean speech could be used to decode these different classes of data, but would result in a very high error rate when transcribing all data classes other than clean speech. For instance, if this system were used to decode a segment with pure music, it would produce a string of words whereas there is in fact no speech in the input, leading to a high insertion error rate. One way to solve this problem is to use a "mumble-word" model in the speech-recognizer. This mumble-model is designed so that it matches the noise-like portion of the acoustic input, and hence can eliminate some of the insertion errors. However, the amount of performance improvement obtained by this technique is limited.