A. Related Applications
This application is related to U.S. Ser. No. 08/235,046, entitled "Speech Interpreter With A Unified Grammar Compiler," filed Apr. 29, 1994 now U.S. Pat. No. 5,642,519, and incorporated herein by reference.
B. Field of the Invention
This invention relates generally to speech comprehension and, more particularly, to voice-activated command and control systems. The present invention facilitates the process of recognizing speech by automatically eliminating many nonsensical word sequences from consideration.
C. Description of the Related Art
Speech recognition systems generally include two main elements: a speech recognizer and an interpreter. The speech recognizer converts sound input into sequences of words. The interpreter then tries to understand the input by determining the relevant meaning of the words.
To achieve useful recognition rates, conventional speech recognition systems impose constraints other than word lists, such as by specifying a grammar that delineates allowable or acceptable word sequences or by providing statistical likelihoods for word sequences.
Programmers build the grammars by hand or with rule compilers as context-free formalisms determining the acceptable word sequences. Hand-built grammars generally provide fine control over the word sequences recognized, but their construction is difficult and painstaking, even for the relative few who are initiated in the art.
The statistical models, on the other hand, use tables of the probabilities of each word (unigram), each word pair (bigram), and each word triple (trigram). Some researchers have experimented with extending the statistical systems to include n-grams where "n" is higher than 3, but generally they only express the probabilities of adjacent words. Statistical grammars are built automatically by simply running an analysis program over an appropriate collection of the kinds of sentences that one wishes to recognize. A prime example of this technology is the ARPA-funded Wall Street Journal dictation project. In that project researchers train on the text of previously-printed articles from the Wall Street Journal and test them on text read from a later edition of the Journal. Unfortunately, the database of Wall Street Journal text used in these experiments contains approximately 44 million words, and some of the researchers using this database have indicated that their speech recognition systems would work better with a more complete training set.
In the domain of voice-activated command and control systems, one use for a speech recognition system, the utterances to be recognized do not correspond directly to any existing body of text that could be used analogously to the Wall Street Journal text's role in training the dictation recognizers. Traditional statistical modeling requires a huge database of expected utterances. Statistical models do not have any abstraction of the words, so the actual co-occurrence of words is necessary to count the relative frequency of each. This means that any word pair that does not occur in the training data would be assigned the most unlikely bigram probability.
One application for a voice-activated command and control system allows speakers to query the contents of a computerized catalog of products. Such an application requires a grammar that recognizes action words and phrases, such as "can you show me &lt;item&gt;?" or "what &lt;item&gt; do you carry?" in a spoken query, with the speaker supplying a phrase that specifies the item of interest. One over-simplified grammar of such item specification phrases would allow any basic item such as "pants" to be modified by any combination of style family, pattern style, color, size, gender, age, fabric type, fabric style, maker's name, etc. A particular sweater could thus be called "the petite women's medium pink jewel-neck cashmere fine-knit `drifter` sweater." Such an accepting grammar could perform at acceptable levels for extracting the meaning of an input word combination from a written form of the item description.
The perplexity of the grammar produced by the cross-product of all these choices is, however, so large that the word accuracy of the speech recognition becomes uselessly low when such a loosely constrained grammar is used. The speech recognition system would accept phrases that no user would ever utter, for example, "the casual cashmere diaper bag."
If a lexicon contained lexical or word entries for every modifier marked with a feature containing the set of things it could realistically modify, or a set of classes of things, then the grammar could be written to allow only the reasonable or acceptable combinations and to rule out the ridiculous ones that should be omitted. This would in turn reduce the perplexity created by loosely constrained grammars. With a grammar compiler that accepts such restrictions based on features in the lexicon, such a markup seems like a possible solution. The grammar could, for instance, record classes of basic items, noting that, for example, "chinos" and "jeans" are "tough clothing" and then only allow them to be associated with fabrics appropriate for "tough" clothes. This would block "lace chinos" but allow "silk blouse" and "denim jeans." The disadvantage of this approach is that it requires a grammar writer to figure out and record the features that determine allowable modifiers as well as a large amount of detailed work to make the annotations in the lexicon.
Although a loosely constrained grammar permits recognition of unacceptable word sequences, tighter constraints based on exactly the items in the catalog would refuse certain acceptable word sequences. Such a system would not recognize combinations of modifiers and basic items that, while reasonable to the speaker, are not specified exactly in the catalog. For example, if the catalog had "canvas jackets" and "denim jeans" but no "denim jackets," then the speech recognition system with such a restricted grammar built from only catalog item descriptions could not understand the phrase "denim jacket." Presented with those sounds, the system might produce something like the "d'women jacket" pronunciation of "the women['s] jacket," but it could not understand what the user said, i.e., interpret what the user said into a hypothetical catalog item. This would be baffling to a naive user of the system, especially since rephrasing his request to include "a jacket made of denim" would also fail.
There is therefore a need for a speaker-independent speech recognition system for a multitude of command and control applications that uses a flexible grammar to limit acceptable word sequences in a manner that improves word accuracy in the speech recognition process.