All modern speech recognition technologies rely on matching user utterances, i.e., spoken words or speeches, to internal representation of sounds and then comparing groupings of sounds to data files of words. The data files may be dictionary files or grammar files.
Dictionary files contain data of sound representations corresponding to individual words. Grammar files contain data of sound representations corresponding to syntactically correct sentence structures. The comparison of the grouping of sounds to the word data may rely on dictionary files, a method commonly referred to as “dictation”, or grammar files, a method commonly referred to as “command and control”. Typically, either dictionary files or grammar files are used, but not both. In other words, a speech recognition engine tends to use either dictation method or command and control method and rarely mixes these two methods.
When dictionary files are used for pattern matching, groups of sounds are matched against individual words. As individual words are to be matched, the comparison must be made against a large number of sound groupings. In order to be able to identify a match from the large pool, confidence threshold for the comparison tends to be set to a lower value, which generally leads to a lower recognition accuracy.
To improve dictation recognition, a technology called language models may be used. Using this technology, a large number of relevant corpora are first analyzed to generate sophisticated statistical representation of likely sentence construction. The statistical information may include correlation between words, frequency of certain phrases and word patterns or the like. During the process of dictation speech recognition, the statistical information from the language models may be used to weigh matches of group of sounds to groups of words. The additional statistical information permits a threshold higher than that is practical for dictation recognition to be set, thus improving the recognition accuracy.
When creating a language model, relevant corpora, i.e., a collection of written text relevant to a particular knowledge area, may be analyzed. Typically, corpora for creating or establishing language models consist of magazine articles, newspapers or other written material. Once a corpus is compiled, it is often fed to a language model tool or language model generator so that statistical information may be generated from the corpus. However, there tends to be a difference between written expressions and oral expressions. Additionally, there may be a difference between written material and live dialogues. Language models generated from written material therefore may not provide statistical information consistent with spoken language. The recognition accuracy of a conversation tends to suffer as a result.
When grammar files are used, groups of sounds are compared with exact construction of utterances, here generally referred to as grammar rules. Each grammar rule usually contains a fairly limited vocabulary. The small number of words that have to be identified in a grammar rule generally leads to a higher recognition accuracy.
Grammar rules are pattern matching rules that may parse grammatically correct sentences. Grammar rules themselves do not have to be grammatically correct sentences. For example, a grammar rule may have the form                [I|we|you|he|she|they|it] [like|want|prefer|love] [red|blue|yellow|green]Each pair of brackets represents a placeholder for a word at that position in a sentence. Words enclosed by each pair of brackets are option words that may be selected for that position. The grammar rule shown here may parse correctly the sentences, for example, “I like blue”, or “they prefer yellow”. Grammar rules permit the construction of a wide range of candidate sentences from a compact representation. Appropriate grammar rules, instead of a large pool of all possible individual candidate words, may be selected for each comparison. As noted, each grammar rule tends to have a far limited number of candidate words. Thus, a relatively higher threshold may be set for a comparison, which generally leads to a higher recognition accuracy.        
While the use of grammar files may dramatically reduce the number of candidate words to be matched, i.e., recognized, the construction of grammar rules tends to be tedious and, when created manually, error-prone. For example, each list of option words may require careful consideration during the construction of each grammar rule. When creating grammar rules manually, people may tend not to create grammar rules as complex as possible and as comprehensive as possible by entering as many option words as desirable for each placeholder for all grammar rules. This may limit the range of utterances that may be recognized by a speech recognition engine utilizing these grammar rules. Any errors in the option words entered or omissions of option words from grammar rules may also lead to errors in the recognition result.
In addition, while using grammar files, it is known to direct speech recognition engine to load, i.e., to use, different grammar rules depending on the context of the speech to be recognized. This requires that similar but not identical grammar rules be created for each context that may be anticipated. This may dramatically multiply the task of creating grammar rules manually and tends to make the manual creation of grammar rules even more tedious and error-prone.
While some speech recognition engines may be able to load several different grammar files and arrange them in a hierarchy, i.e., a search sequence, in its search for matches, i.e., search for matches in these grammar files according to a pre-determined sequence, the pre-determined hierarchy may not best suit each actual conversation to be recognized. Additionally, pre-created grammar rules may not be optimally tailored for use by a speech recognition engine in all conversation contexts. It is therefore an object of the present invention to obviate or mitigate the above disadvantages.