An automatic speech recognition (ASR) system determines a semantic meaning of a speech input. Typically, the input speech is processed into a sequence of digital speech feature frames. Each speech feature frame can be thought of as a multi-dimensional vector that represents various characteristics of the speech signal present during a short time window of the speech. For example, the multi-dimensional vector of each speech frame can be derived from cepstral features of the short time Fourier transform spectrum of the speech signal (MFCCs)—the short time power or component of a given frequency band—as well as the corresponding first- and second-order derivatives (“deltas” and “delta-deltas”). In a continuous recognition system, variable numbers of speech frames are organized as “utterances” representing a period of speech followed by a pause, which in real life loosely corresponds to a spoken sentence or phrase.
The ASR system compares the input speech frames to find statistical models that best match the speech feature characteristics and then determines a corresponding representative text or semantic meaning associated with the statistical models. Modern statistical models are state sequence models such as hidden Markov models (HMMs) that model speech sounds (usually phonemes) using mixtures of Gaussian distributions. Often these statistical models represent phonemes in specific contexts, referred to as PELs (Phonetic Elements), eg., triphones or phonemes with known left and/or right contexts. State sequence models can be scaled up to represent words as connected sequences of acoustically modeled phonemes, and phrases or sentences as connected sequences of words. When the statistical models are organized together as words, phrases, and sentences, additional language-related information is also typically incorporated into the models in the form of language modeling.
The words or phrases associated with the best matching model structures are referred to as recognition candidates or hypotheses. A system may produce a single best recognition candidate—the recognition result—or a list of several hypotheses, referred to as an N-best list. Further details regarding continuous speech recognition are provided in U.S. Pat. No. 5,794,189, entitled “Continuous Speech Recognition,” and U.S. Pat. No. 6,167,377, entitled “Speech Recognition Language Models,” the contents of which are incorporated herein by reference.
In many ASR applications—for example, cloud-based client-server ASR—the speech recognizer is exposed to speech data acquired from many different devices operating in various acoustic environments and from different applications such as messaging or voice search. Device type, microphone type (and position on the device) as well as acoustic environment have an influence on the observed audio. To a somewhat lesser degree, application type has this effect as it affects speaking style and the way users generally hold and operate the device. Furthermore, the acoustic signal as observed at the ASR system usually has passed through a data channel of limited bandwidth that requires application of an encoding that comes along with information loss (lossy codec). Often, this information on device- and microphone-type, on application, on codec setting is available to the server recognizer as meta-data that characterizes such aspects of the audio input channel.
Meta-data typically is categorical information that can attain one of a finite set of values. Representing this data as input for neural networks or other forms of linear and non-linear transformations allows many approaches. Categorical information can be represented as numerical values, for example, by representing each category as an integer or using 1-of-N encoding.