An automatic speech recognition (ASR) system determines a semantic meaning of input speech. Typically, the input speech is processed into a sequence of digital speech feature frames. Each speech feature frame can be thought of as a multi-dimensional vector that represents various characteristics of the speech signal present during a short time window of the speech. In a continuous recognition system, variable numbers of speech frames are organized as “utterances” representing a period of speech followed by a pause, which in real life loosely corresponds to a spoken sentence or phrase.
The ASR system compares the input speech frames to a database of statistical models to find the models that best match the speech feature characteristics and determine a corresponding representative text or semantic meaning associated with the models. Modern statistical models are state sequence models such as hidden Markov models (HMMs) that model speech sounds (usually phonemes) using mixtures of Gaussian distributions. Often these statistical models represent phonemes in specific contexts, referred to as PELs (Phonetic Elements), e.g. triphones or phonemes with known left and/or right contexts. State sequence models can be scaled up to represent words as connected sequences of acoustically modeled phonemes, and phrases or sentences as connected sequences of words. When the statistical models are organized together as words, phrases, and sentences, additional language-related information is also typically incorporated into the models in the form of language modeling.
The words or phrases associated with the best matching model structures are referred to as recognition candidates or hypotheses. A system may produce a single best recognition candidate—the recognition result—or a list of several hypotheses, referred to as an N-best list. Further details regarding continuous speech recognition are provided in U.S. Pat. No. 5,794,189, entitled “Continuous Speech Recognition,” and U.S. Pat. No. 6,167,377, entitled “Speech Recognition Language Models,” the contents of which are incorporated herein by reference.
There are various established techniques for specializing the acoustic models of a speaker-independent speech recognizer to the speech characteristics of a single speaker or certain group of speakers or a specific acoustic channel. Well known and popular acoustic model parameter adaptation methods include linear transform based approaches such as maximum likelihood linear regression (MLLR) and maximum a posteriori (MAP), and model-space approaches such as discriminative acoustic model parameter refinement methods. See M. J. F. Gales, Maximum Likelihood Linear Transformations For HMM-Based Speech Recognition, Technical Report TR. 291, Cambridge University, 1997; J.-L. Gauvain, and C.-H. Lee, Maximum A Posteriori Estimation for Multivariate Gaussian Mixture Observations of Markov Chains, IEEE Transactions on Speech and Audio Processing, 1994; D. Povey, P. C. Woodland, and M. J. F. Gales, Discriminative MAP for Acoustic Model Adaptation, ICASSP, 2003; all of which are incorporated herein by reference. These are used in many speech recognition applications as tools for speaker-specific performance improvements.
In online speech recognition applications such as command & control, dictation, and voice search, an acoustic model can be cumulatively adapted for a particular speaker based on speech samples obtained during multiple sessions with the speaker. The adaptation may include accumulating adaptation statistics after each utterance recognition based on the speech input of the utterance and the corresponding recognition result. An adaptation transform may be updated after every number M utterance recognitions using some number T seconds worth of recognition statistics. (See, e.g., U.S. Patent Publication 2008/0004876, which is incorporated herein by reference). The model can be modified during the session or after the session is terminated. Upon termination of the session, the modified model is then stored in association with an identification of the speaker. During subsequent remote sessions, the speaker is identified and, then, the modified acoustic model is utilized to recognize the speaker's speech. (See e.g., U.S. Pat. No. 6,766,295, which is incorporated herein by reference).
In state-of-the-art speech transcription systems that perform offline (batch mode) speech recognition it is common practice to apply acoustic model adaptation techniques to improve recognition accuracy, but there problems in implementing such adaptation techniques in large-scale real-time server-based speech recognition. For example, acoustic model adaptation cannot be applied in a fully unconstrained manner because keeping available millions of acoustic models at low switching time is infeasible. In addition, it is not feasible to in large scale real-time server-based speech recognition to keep available millions of user-dependent acoustic model adaptation statistics or to re-estimate the user-dependent statistics after each application usage.