Operation of a typical speech recognition engine according to the prior art is illustrated in FIG. 1. A speech signal 10 is directed to a pre-processor 11, where relevant parameters are extracted. A pattern matching recognizer 12 tries to find the best word sequence recognition result 15 based on acoustic models 13 and a language model 14. The language model 14 describes words and how they connect to form a sentence. It might be as simple as a list of words in the case of an isolated word recognizer, or a context-free grammar, or as complicated as a statistical language model for large vocabulary continuous speech recognition. The acoustic models 13 establish a link between the speech parameters from the pre-processor 11 and the recognition symbols that need to be recognized. Further information on the design of a speech recognition system is provided, for example, in Rabiner and Juang, Fundamentals of Speech Recognition (hereinafter “Rabiner and Juang”), Prentice Hall 1993, which is hereby incorporated herein by reference.
More formally, speech recognition systems typically operate by determining a word sequence, Ŵ that maximizes the following equation:
      W    ^    =      arg    ⁢                                            ⁢        max            W        ⁢                  ⁢          P      ⁡              (        W        )              ⁢          P      ⁡              (                  A          ⁢                      ❘                    ⁢          W                )            where A is the input acoustic signal, W is a given word string, P(W) is the probability that the word sequence W will be uttered, and P(A|W) is the probability of the acoustic signal A being observed when the word string W is uttered. The acoustic model characterizes P(A|W), and the language model characterizes P(W).
Rather than a single best recognition result, speech recognition applications may also give feedback to users by displaying or prompting a sorted list of some number of the best matching recognition hypotheses, referred to as an N-best list. This can be done for recognition of a spoken utterance as one or more words. This can also be done when the input is a spelled out sequence of letters forming one or more words, or a part of a word, in which case the best matching name may be identified by a spelling-matching module.
It is also known to rescore such N-best lists using additional information that was not available when the N-best list was initially constructed. Such extra information may come from various sources such as a statistical language model (SLM) that contains information about the a priori likelihood of the different recognition hypotheses. Even if the language model applied during the recognition is itself a statistical language model, the N-best list can still be rescored by means of another (typically more sophisticated) SLM. Rescoring of N-best lists based on a Statistical Language Model is described, for example, as a “Dynamic Semantic Model” in U.S. Pat. No. 6,519,562, which is incorporated herein by reference.