Referring to FIG. 3, in a typical speech recognition engine 1000, a signal 1002 corresponding to speech 1004 is fed into a front end module 1006. The front end 1006 module extracts feature data 1008 from the signal 1002. The feature data 1008 is input to a decoder 1010, which the decoder 1010 outputs as recognized speech 1012. An application 1014 could, for example, take the recognized speech 1012 as an input to display to a user, or as a command that results in the performance of predetermined actions.
To facilitate speech recognition, an acoustic model 1018 and a language model 1020 also supply inputs to the decoder 1010. The acoustic model 1018 utilizes the decoder 1010 to segment the input speech into a set of speech elements and identify to what speech elements the received feature data 1008 most closely correlates.
The language model 1020 assists the operation of the decoder 1010 by supplying information about what a user is likely to be saying. There are two major formats for language models: the finite state grammar (FSG) and the statistical language model (SLM).
The FSG format typically includes a plurality of predetermined text element sequences. “Text element” as used herein can refer to words, phrases or any other subdivision of text, although words are the most common text elements. To apply an FSG format language model, the decoder 1010 compares the feature data 1008 (also utilizing input from the acoustic model 1018) to each of the text element sequences, looking for a best fit.
Provided the user actually is speaking one of the predetermined sequences, the FSG format offers relatively high accuracy. However, if the user does not speak one of the sequences, a decoder applying the FSG format will not yield the correct result. Additionally, compiling a suitable list of sequences for a given application can be time and labor intensive. Moreover, to yield acceptable results for complex applications, an FSG format language model must be extremely large, resulting in higher memory and processing demands.
An SLM format language model, sometimes referred to as an “n-gram” format, is built from a textual corpus by identifying, for each text element (e.g., each word), the probability that the element will be found in proximity with the other text elements. Typically, probabilities are determined for each group of two (bi-gram) or three (tri-gram) text elements, although other quantities can be used. A nominal probability is usually also assigned to text elements groups that do not actually occur in the textual corpus.
The SLM format allows for the potential recognition of a larger range of user utterances with a relatively small language model. However, the accuracy of the SLM format typically compares unfavorably with the FSG format.
The concept has been advanced to combine features of both language model formats to mitigate the disadvantages and capitalize on the advantages of each. An example of efforts in this direction can be found in U.S. Pat. No. 7,286,978. However, there have been only limited practical attempts at such combinations, and further enhancements and improvements are possible.