Automatic Speech Recognition (ASR) systems take a spoken utterance as input and return a recognized word sequence, and may be a recognized semantic interpretation, as result. The set of word sequences that can be recognized by an ASR and corresponding semantic interpretations are typically specified using a grammar that can be a context-free grammar, a statistical grammar or a combination of both.
In the following we will talk about context-free grammars that might use as some components references to external grammars of any type acceptable to the ASR.
There are a variety of formalisms used to specify context free grammars. Various ASR platforms may use: GSL (Grammar Specification Language), GRXML (Grammar XML), ABNF (Augmented Backus—Naur Form). Entering grammars directly in one of these formats requires special expertise and is a time consuming and error prone process. Therefore, many tools for building speech applications provide means to specify grammars in other, much more convenient formats, which are then transformed into the final format understood by an ASR platform. Some tools may simply fully enumerate sets of all allowed phrases. Others may use GUI methods to specify data for grammar generation, e.g. using a table or spreadsheet (U.S. Pat. No. 5,995,918). Still others, may use elements of GSL— or ABNF-like syntax, and allow for creation of grammar annotations on individual semantic concepts, which then may be combined to create final grammars. Finally, some systems avoid entering any portion of the grammar manually and, instead, generate grammars automatically from names of concepts to be recognized (US Patent Application 2006/0259294).
Practically all commercial ASR platforms support also some form of probabilistic annotations of the context-free grammars, e.g. to specify the probability with which a sub-rule or an item may occur, or to specify likelihood of skipping a particular sub-rule or item. Some of those annotations are really weights used to quantitatively differentiate the importance of particular sub-rules.
Even if grammars do not have to be entered directly using GSL, GRXML, or ABNF, defining a grammar sometimes still remains problematic. Two important factors that contribute to this are: (1) a multitude of the utterances and their variations that may be uttered by a user while referring to a single semantic interpretation; (2) difficulty of direct reuse of grammars or portions thereof in various contexts (grammars may have to be context specific).
Regarding the first factor (1), some reasons for the multitude of utterance variations are: (a) numerous word and phrase synonyms that may be applicable; (b) different grammatical variations of the same sentence or phrase; (c) omissions of words and phrases that are implied and/or superfluous given known dialog context (this is one of the causes of the difficulty mentioned in point (2)).
Point (1c) can be illustrated on an example of a photo camera product recognition task. Suppose that individual grammar rules identify product models from a large manufacturer. As an example, a full name of a fictitious digital camera could be: ION Digital Ninja Zi DSLR. Very few users would utter this name in its entirety. A typical user behavior would be to skip some of the words/sub-phrases; moreover, a user would be much more likely to skip words/phrases that are shared in many product names (thus user might not consider them distinguishing and worth mentioning), and much less likely to skip unique ones (ones that user perceives as distinguishing the product). Therefore, to reflect this user behavior, the grammar for this product (semantic interpretation) has to allow for certain words to be optional, accepting e.g. ION Ninja Zi, or, Ninja Zi DSLR. A trivial solution of making all components optional would generally not work well as this might result in higher recognition error rate—because the grammar would accept too many unrealistic utterances with the same weight as realistic ones regardless of the context.
Regarding the second factor (2), the difficulty of reuse, it can be illustrated on the same general example. Let's assume that a grammar was manually entered that corresponds to the semantic interpretation ION Digital Ninja Zi DSLR. This grammar may have been tuned to a specific type of question with a specific set of possible products (semantic interpretations) answering the question. Reusing that grammar in a different context (for a different question) where a set of possible answers (products or semantic interpretations) is different may result in undesired behavior such as non-optimal recognition error rate An example would be if the original grammar was tuned to the full set of products, and we are trying to reuse a portion of it in a question about only a small subset of products including just ION Digital Ninja Zi DSLR, ION Digital Ninja Fixa DSLR and ION Digital Ninja Beta DSLR. In this new context, users are more likely to utter single words like “Zi”, “Fixa” or short phrases including them because they are meaningful given the context; therefore, the grammar would have to be modified to accept these words. However, in the larger context, it may be extremely unlikely that users would utter such short product descriptions; therefore, having them in the grammar may deteriorate the accuracy of recognition. Typically, to avoid this problem one would have to maintain different grammars depending on the context in which they are used, which again increases the amount of work and increases chance for error, especially if system has to be modified in the future.
There is a special aspect to example from point (1c) that applies if there are relationships between the semantic concepts, in particular, if some concepts are (perceived as) more generic while the others are more specific. For example, suppose ION Digital Ninja Zi DSLR product model was a member of a Digital Ninja product family that included numerous product models. Suppose also that answering the most general product model question a user tries to say a name of one of the products in the Digital Ninja family but omits one of the required product name components. In such a case, rather than to misrecognize or reject the utterance completely, a preferable behavior would be to recognize at least a more generic (product family) concept and then ask a follow up question to get to the more specific concept (product model). Which means that the grammar for the generic concept (product family) has to be defined so as to capture utterances that vaguely identify (some of) the specific concepts in addition to capturing utterances identifying this generic concept directly.