An automatic speech recognition (ASR) system determines a semantic meaning of input speech. Typically, the input speech is processed into a sequence of digital speech feature frames. Each speech feature frame can be thought of as a multi-dimensional vector that represents various characteristics of the speech signal present during a short time window of the speech. In a continuous recognition system, variable numbers of speech frames are organized as “utterances” representing a period of speech followed by a pause, which in real life loosely corresponds to a spoken sentence or phrase.
The ASR system compares the input speech frames to find statistical models that best match the speech feature characteristics and determine a corresponding representative text or semantic meaning associated with the statistical models. Modern statistical models are state sequence models such as hidden Markov models (HMMs) that model speech sounds (usually phonemes) using mixtures of Gaussian distributions. Often these statistical models represent phonemes in specific contexts, referred to as PELs (Phonetic Elements), e.g. triphones or phonemes with known left and/or right contexts. State sequence models can be scaled up to represent words as connected sequences of acoustically modeled phonemes, and phrases or sentences as connected sequences of words. When the statistical models are organized together as words, phrases, and sentences, additional language-related information is also typically incorporated into the models in the form of language modeling.
The words or phrases associated with the best matching model structures are referred to as recognition candidates or hypotheses. A system may produce a single best recognition candidate—the recognition result—or a list of several hypotheses, referred to as an N-best list. Further details regarding continuous speech recognition are provided in U.S. Pat. No. 5,794,189, entitled “Continuous Speech Recognition,” and U.S. Pat. No. 6,167,377, entitled “Speech Recognition Language Models,” the contents of which are incorporated herein by reference.
In cloud-based (client-server) ASR, the speech recognizer at the server is exposed to speech data acquired from many different devices and in various acoustic environments and from different applications such as messaging or voice search. Device type, microphone type (and position on the device) as well as acoustic environment have an influence on the observed audio. To a somewhat lesser degree, application type has this effect as it affects speaking style and the way users generally hold and operate the device. All these effects will result in significant variation in the individual input channels in cloud-based ASR systems. Besides cloud-based ASR arrangements, desktop ASR dictation applications face similar issues.
Speech recognition systems typically employ a technique called Cepstral Mean Normalization (CMN) on the input sequence of speech features in order to improve robustness to mismatches in input channel conditions. In general terms, CMN involves calculating the cepstral mean across the utterance and then subtracting it from each frame. There are many different variations to implement CMN effectively; for example, for ASR systems that run in real time online (i.e. with minimal latency incurred) a filter or windowing approach is used.