One specific example of a natural language processing application is an automatic speech recognition (ASR) system that tries to determine a representative meaning (e.g., text) corresponding to input speech. Typically, the input speech is processed into a sequence of digital frames. Each frame can be thought of as a multi-dimensional vector that represents various characteristics of the speech signal present during a short time window of the speech. In a continuous speech recognition system, variable numbers of frames are organized as “utterances” representing a period of speech followed by a pause which in real life loosely corresponds to a spoken sentence or phrase.
The ASR system compares the input utterances to find statistical acoustic models that best match the vector sequence characteristics and determines corresponding representative text associated with the acoustic models. When framing the problem in a statistical context, it is often discussed in terms of the well-known Bayes formula. That is, if given some input observations A, the probability that some string of words W were spoken is represented as P(W|A), where the ASR system attempts to determine the most likely word string:
      W    ^    =      arg    ⁢                  ⁢                  max        W            ⁢              P        ⁡                  (                      W            |            A                    )                    Given a system of statistical acoustic models, this formula can be re-expressed as:
      W    ^    =      arg    ⁢                  ⁢                  max        W            ⁢                        P          ⁡                      (            W            )                          ⁢                  P          ⁡                      (                          A              |              W                        )                              where P(A|W) corresponds to the acoustic models and P(W) represents the value of a statistical language model reflecting the probability of given word in the recognition vocabulary occurring.
Modern acoustic models typically use probabilistic state sequence models such as hidden Markov models (HMMs) that model speech sounds (usually phonemes) using mixtures of probability distribution functions, typically Gaussians. Phoneme models often represent phonemes in specific contexts, referred to as PELs (Phonetic Elements), e.g. triphones or phonemes with known left and/or right contexts. State sequence models can be scaled up to represent words as connected sequences of acoustically modeled phonemes, and phrases or sentences as connected sequences of words. When the models are organized together as words, phrases, and sentences, additional language-related information is also typically incorporated into the models in the form of a statistical language model.
The words or phrases associated with the best matching model structures are referred to as recognition candidates or hypotheses. A system may produce a single best recognition candidate—the recognition result—or multiple recognition hypotheses in various forms such as an N-best list, a recognition lattice, or a confusion network. Further details regarding continuous speech recognition are provided in U.S. Pat. No. 5,794,189, entitled “Continuous Speech Recognition,” and U.S. Pat. No. 6,167,377, entitled “Speech Recognition Language Models,” the contents of which are incorporated herein by reference.