ASR technologies enable microphone-equipped computing devices to interpret speech and thereby provide an alternative to conventional human-to-computer input devices such as keyboards or keypads. A typical ASR system includes several basic elements. A microphone and an acoustic interface receive an utterance of a word from a user, and digitize the utterance into acoustic data. An acoustic pre-processor parses the acoustic data into information-bearing acoustic features. A decoder uses acoustic models to decode the acoustic features into utterance hypotheses. The decoder generates a confidence value for each hypothesis to reflect the degree to which each hypothesis phonetically matches a subword of each utterance, and to select a best hypothesis for each subword. Using language models, the decoder concatenates the subwords into an output word corresponding to the user-uttered word.
One problem encountered with ASR is that input audio contains not only speech utterances of a user, but also contains undesirable noise. Such noise can include ambient noise like continuous vehicle road noise, and transient noise like that from windshield wiper operation or non-speech vocalizations like coughing. Receipt of such transient noise by an ASR system may lead to ASR rejection errors where speech cannot be recognized, or errors of insertion or substitution of acoustic data that leads to misrecognition of speech. This problem is even more prevalent with speakers of tonal languages, like Mandarin, where digits are uttered as single syllables. For instance, when driving at highway speeds with windows open, wind buffeting at a vehicle microphone causes transient noise so severe that Mandarin digit dialing is nearly impossible.