Speech recognizers are systems for identifying a user utterance based on two sources of information: acoustic models and language models that can be either statistically-based grammars or closed grammars. Language models or grammars are used to provide phrase hypotheses, while acoustic models are used to provide acoustic descriptions of phoneme strings of phrase hypotheses in probabilistic terms.
Speech recognition is a process of searching for the best match of an input utterance to all possible phrase hypotheses created from the language models or grammars. So, more complex grammars provide more phrase hypotheses that make recognition processing take more time and consume more resources, and in many cases make the recognition less accurate.
The speech recognizer typically allows for a wide variety of speech recognition parameters to be set. For example, recognizer parameters include parameters set at recognition time such as confidence level, pruning, and noise reduction, and those set at compile time such as whether to use crossword compilation and/or multiple acoustic models. Pruning values affect the number of hypotheses that the recognizer will use to attempt to recognize an utterance. There is an accuracy/processing load tradeoff, with increased pruning decreasing accuracy and reducing processing load and vice versa. Noise reduction is the process of reducing background noise (for example, from a noisy restaurant or due to the method of transmission) from an audio stream. With less background noise, the recognition may be more accurate. Noise reduction, however, requires time and system resources, and parts of the speech signal may be inadvertently removed, reducing recognition accuracy. Using crossword compilation increases the processing load but can improve accuracy in recognizing series of short words. Whether to use multiple acoustic models is another recognition parameter that may be set. Using multiple acoustic models, the recognizer will attempt to recognize the same utterance with multiple recognition models including generic models and models that are tuned for different criteria, e.g., environment, gender, accent, etc. Using multiple acoustic models, the best result from the multiple recognition attempts is used as the recognized result. Whether to use skip frames is still another recognition parameter. Using skip frames, every other 10 ms (or other applicable time increment) spectral/cepstral-analyzed portion (i.e., the frame) of an utterance is ignored by the recognizer. This decreases both accuracy and processor load, although accuracy may not suffer depending upon the utterance and the grammar.
These parameters may be hard-coded by an Interactive Voice Response (IVR) system using the recognizer or provided with an external grammar. If no parameters are provided, default recognition parameters may be used.