Speech recognition systems are specialized computer systems that are configured to process and recognize spoken human speech, and take action or carry out further processing according to the speech that is recognized. Such systems are now widely used in a variety of applications including airline reservations, auto attendants, order entry, etc. Generally the systems comprise either computer hardware or computer software, or a combination.
Speech recognition systems typically operate by receiving an acoustic signal, which is an electronic signal or set of data that represents the acoustic energy received at a transducer from a spoken utterance. The systems then try to find a sequence of text characters (“word string”) which maximizes the following probability:P(A|W)*P(W)
where A means the acoustic signal and W means a given word string. The P(A|W) component is called the acoustic model and P(W) is called the language model.
A speech recognizer may be improved by changing the acoustic model or the language model, or by changing both. The language may be word-based or may have a “semantic model,” which is a particular way to derive P(W).
Typically, language models are trained by obtaining a large number of utterances from the particular application under development, and providing these utterances to a language model training program which produces a word-based language model that can estimate P(W) for any given word string. Examples of these include bigram models, trigram language models, or more generally, n-gram language models.
In a sequence of words in an utterance, W0–Wm, an n-gram language model estimates the probability that the utterance is word j given the previous n−1 words. Thus, in a trigram, P(Wj|utterance) is estimated by P(Wj|Wj−1, Wj−2). The n-gram type of language model may be viewed as relatively static with respect to the application environment. For example, static n-gram language models cannot change their behavior based upon the particular application in which the speech recognizer is being used or external factual information about the application. Thus, in this field there is an acute need for an improved speech recognizer that can adapt to the particular application in which it is used.
An n-gram language model, and other word-based language models work well in applications that have a large amount of training utterances and the language model does not change over time. Thus, for applications in which large amounts of training data are not available, or where the underlying language model does change over time, there is a need for an improved speech recognizer that can produce more accurate results by taking into account application-specific information.
Other needs and objects will become apparent from the following detailed description.