An automatic speech recognition (ASR) system determines a semantic meaning of input speech. Typically, the input speech is processed into a sequence of digital speech feature frames. Each speech feature frame can be thought of as a multi-dimensional vector that represents various characteristics of the speech signal present during a short time window of the speech. In a continuous recognition system, variable numbers of speech frames are organized as “utterances” representing a period of speech followed by a pause, which in real life loosely corresponds to a spoken sentence or phrase.
The ASR system compares the input speech frames to find statistical models that best match the speech feature characteristics and determine a corresponding representative text or semantic meaning associated with the statistical models. Modern statistical models are state sequence models such as hidden Markov models (HMMs) that model speech sounds (usually phonemes) using mixtures of Gaussian distributions. Often these statistical models represent phonemes in specific contexts, referred to as PELs (Phonetic Elements), e.g. triphones or phonemes with known left and/or right contexts. State sequence models can be scaled up to represent words as connected sequences of acoustically modeled phonemes, and phrases or sentences as connected sequences of words. When the statistical models are organized together as words, phrases, and sentences, additional language-related information is also typically incorporated into the models in the form of language modeling.
The words or phrases associated with the best matching model structures are referred to as recognition candidates or hypotheses. A system may produce a single best recognition candidate—the recognition result—or a list of several hypotheses, referred to as an N-best list. Further details regarding continuous speech recognition are provided in U.S. Pat. No. 5,794,189, entitled “Continuous Speech Recognition,” and U.S. Pat. No. 6,167,377, entitled “Speech Recognition Language Models,” the contents of which are incorporated herein by reference.
In order to compensate for audio channel effects, speech recognition systems typically employ techniques such as Cepstral Mean Normalization (CMN) and Cepstral Variance Normalization (CVN) on the input sequence of speech features in order to map the speech features into a more unified space and to reduce channel dependence. There are many different variations to implement CMN and/or CVN effectively; for example, for ASR systems that run in real time online (i.e. with minimal latency incurred) a filter or windowing approach is used. In addition or alternatively, a separate normalization can be done for speech and silence portions of the speech input. One disadvantage of conventional CMN (and CVN) is that the information about the absolute energy level of the speech features is lost. Some other alternative normalization approach could make use of this information.
Each normalization approach comes with its own specific advantages and disadvantages. For example, a longer window gives a better mean estimation, but may not track changes in the channel that well. Or the mean may be estimated only from speech- containing portions of the input, not during silences (e.g., using a voice activity detector (VAD)). This makes the normalization more invariant against the noise level of the signal, which improves recognizer performance, but it is important that the VAD functions well; when it does not, performance can degrade. There are also various different ways to normalize other speech features besides the mean.