1. Technical Field
This invention relates to the field of speech recognition systems, and more particularly, to processing the same utterance with multiple grammar subsets.
2. Description of the Related Art
Speech recognition is the process by which an acoustic signal received by microphone is converted to a set of text words by a computer. These recognized words can be used in a variety of computer applications, for example, software applications, for purposes such as document preparation, data entry, and command and control. Speech recognition systems programmed or trained to the diction and inflection of a single person can successfully recognize the vast majority of words spoken by that person.
In operation, speech recognition systems can model and classify acoustic signals to form acoustic models, which are representations of basic linguistic units referred to as phonemes. Upon receipt of the acoustic signal, the speech recognition system can analyze the acoustic signal, identify a series of acoustic models within the acoustic signal and derive a list of potential word candidates for the given series of acoustic models. Subsequently, the speech recognition system can contextually analyze the potential word candidates using a language model as a guide.
The task of the language model is to express restrictions imposed on the manner in which words can be combined to form sentences. The language model can express the likelihood of a word appearing immediately adjacent to another word or words. Language models used within speech recognition systems typically are statistical models. Examples of well-known language models suitable for use in speech recognition systems include uniform language models, finite state language models, and n-gram language models.
Acoustic models for a particular speaker can be refined during the operation of the speech recognition system to improve the system's accuracy. That is, the speech recognition system can observe speech dictation as it occurs and can modify the acoustic model accordingly. Typically, an acoustic model can be modified when a speech recognition training program analyzes both a known word and the recorded audio of a spoken version of the word. In this way, the speech training program can associate particular acoustic waveforms with corresponding phonemes contained within the spoken word.
The accuracy of a speech recognition system can be further improved by the use of a grammar. The grammar is a model which can be used to designate syntactically proper sentences which can be derived from a vocabulary of words. During processing, the grammar can be used to select between syntactically correct and incorrect outputs based on syntactical associations that can be represented in a state transition network. While a grammar can be effective, its usefulness can be limited. For example, in a speech recognition session, the speech recognition process is limited to traversal of only the exact paths defined by the state transition network. Furthermore, if multiple grammars are employed to improve accuracy by adding more context selections, then duplication of items amongst the various grammars can result in ambiguities within the grammars. In particular, the system will not know which path within the state transition networks is appropriate.
Another problem associated with using grammars can be the size of the grammar. The larger the grammar, the greater the state transition network, and the greater the number of paths that have to be traversed in the network. This increased time required to traverse the state transition network corresponds to increased processing time, which in turn translates to increased delay. The delays can result in the system losing its “natural” feel to users and in certain applications, for example, telephony, this increased delay can be unacceptable and costly.