Automatic speech recognition (ASR) systems try to determine a representative meaning (e.g., text) corresponding to speech inputs. Typically, the speech input is processed into a sequence of digital frames which are multi-dimensional vectors that represent various characteristics of the speech signal present during a short time window of the speech. In a continuous speech recognition system, variable numbers of frames are organized as “utterances” representing a period of speech followed by a pause which in real life loosely corresponds to a spoken sentence or phrase. The ASR system compares the input utterances to find statistical acoustic models that best match the vector sequence characteristics and determines corresponding representative text associated with the acoustic models. More formally, given some input observations A, the probability that some string of words W were spoken is represented as P(W|A), where the ASR system attempts to determine the most likely word string:
      W    ^    =      arg    ⁢                  ⁢                  max        W            ⁢                          ⁢              P        ⁡                  (                      W            |            A                    )                    Given a system of statistical acoustic models, this formula can be re-expressed as:
      W    ^    =      arg    ⁢                  ⁢                  max        W            ⁢                          ⁢                        P          ⁡                      (            W            )                          ⁢                  P          ⁡                      (                          A              |              W                        )                              where P(A|W) corresponds to the acoustic models and P(W) represents the value of a statistical language model reflecting the probability of given word in the recognition vocabulary occurring.
The acoustic models are typically probabilistic state sequence models such as hidden Markov models (HMMs) that model speech sounds using mixtures of probability distribution functions (Gaussians). Acoustic models often represent phonemes in specific contexts, referred to as PELs (Phonetic Elements), e.g. triphones or phonemes with known left and/or right contexts. State sequence models can be scaled up to represent words as connected sequences of acoustically modeled phonemes, and phrases or sentences as connected sequences of words. When the models are organized together as words, phrases, and sentences, additional language-related information is also typically incorporated into the models in the form of a statistical language model.
The words or phrases associated with the best matching model structures are referred to as recognition candidates or hypotheses. A system may produce a single best recognition candidate—the recognition result—or multiple recognition hypotheses in various forms such as an N-best list, a recognition lattice, or a confusion network. Further details regarding continuous speech recognition are provided in U.S. Pat. No. 5,794,189, entitled “Continuous Speech Recognition,” and U.S. Pat. No. 6,167,377, entitled “Speech Recognition Language Models,” the contents of which are incorporated herein by reference.
Some ASR systems pre-process the input speech frames (observation vectors) to account for channel effects and noise, for example, using explicit models of noise, channel distortion, and their interaction with speech. Many interesting and effective approximate modeling and inference techniques have been developed to represent these acoustic entities and the reasonably well understood but complicated interactions between them. While there are many results showing the promise of these techniques on less sophisticated systems trained on small amounts of artificially mixed data, there has been little evidence that these techniques can improve state of the art large vocabulary ASR systems.
There a number of fundamental challenges to designing noise-robust ASR systems. Efficient modeling and inference is needed that balances the trade-off between computational complexity and performance. System modeling also needs to be robust to improve system ASR performance in noisy conditions without degrading performance in clean (low-noise) conditions. And robust adaptation also is desired that improves system performance in noise conditions not seen during system training.
Dynamic noise adaptation (DNA) is a model-based technique for improving ASR performance in the presence of noise. See Rennie et al. Dynamic Noise Adaptation, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing 2006, 14-19 May 2006; Rennie and Dognin, Beyond Linear Transforms: Efficient Non-Linear Dynamic Adaptation For Noise Robust Speech Recognition, in Proceedings of the 9th International Conference of Interspeech 2008, Brisbane, Australia, Sep. 23-26, 2008; Rennie et al., Robust Speech Recognition Using Dynamic Noise Adaptation, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2011, Prague, Czech Republic, May 22-27, 2011; all incorporated herein by reference. DNA is designed to compensate for mismatch between training and testing conditions, and recently, DNA has been shown to improve the performance of even commercial-grade ASR systems trained on large amounts of data. However, new investigations with yet more data and yet stronger baseline systems have revealed that conventional DNA can sometimes harm ASR performance, especially when the existing noise conditions are well characterized by the back-end acoustic models. Such issues could be mitigated by applying the model-based approach to the recognizer itself and training acoustic models of speech that recover a canonical representation of speech, together with a noise model, which could be adapted. But this paradigm is not yet fully mature.