Advances in processing power and software execution are making speech recognition systems more desirable. Interactive voice response (IVR) systems are used extensively in telephone systems for guiding customers through a maze of options to obtain the desired information. Voice recognition systems are also being offered as a means for interacting with computer systems or system controlled by computers. Moreover, voice-controlled systems offer a way for physically handicapped users, for example, to benefit from computer technology by providing means for interacting through software programs that respond based on the quality of speech as converted and recognized by the underlying recognition system. However, widespread use of voice recognition systems that use audio input are underutilized due to reliability concerns.
As expected in speech recognition, it is unlikely that a word will be pronounced exactly the same way twice, so it is furthermore unlikely that the recognizer will find an exact match. Moreover, for any given segment of sound, there are many things the speaker could potentially be saying. The quality of a recognizer is determined by how good it is at refining its search, eliminating the poor matches, and selecting the more likely matches.
Voice recognition systems employ a list of words (or dictionary) that can be recognized by the recognizer engine. The grammar consists of a structured list of rules that identify words or phrases that can be used for speech recognition. These rules provide the guidelines that an application uses when collecting input terms or phrases voiced by a user. The possibilities of speech that can be recognized are limited by the size of the dictionary (or grammar) on which the recognizer depends.
Application grammar writing can be complex, time consuming and error-prone without help from editing tools. Moreover, the grammar editor should be alerted if there are terms or phrases with different semantic meaning in the grammar, but are easily confused by the speech recognition engine (e.g., “see” and “sea”). However, static methods using phone distance matrices are computationally more expensive and do not reveal the confusability metrics from the view of the speech recognition engine.