People often wish to communicate with robots about what they would like done. It is awkward to be constrained to specific set of commands. Therefore, a free-form interface that supports natural human robot interaction is desirable.
A finite state transducer is a finite automaton whose state transitions are labeled with both input and output labels. A path through the transducer encodes a mapping from an input symbol sequence to an output symbol sequence. Grammar is a structure that defines a set of words or phrases that a person is expected to say and the way the words can be or are expected to be combined.
It is a challenge to develop a grammar that will recognize a large variety of phrases and achieve a high recognition accuracy. Manual creation of a set of grammar rules can be very tedious. In many cases, the out-of-grammar rate obtained with hand crafted grammar rules is high because of poor coverage.
In contrast to grammar, a Statistical Language Model (SLM) assigns probabilities to a sequence of words. SLM probabilities are appropriate for recognizing free-style speech, especially when the out-of-grammar rate is high. They are trained from a set of examples, which are used to estimate the probabilities of combinations of words. Training sets for SLM grammars are often collected from users as they interact with the particular application. Over time, the SLM grammar is refined to recognize the statistically significant phrases.
Conventional systems typically define grammar manually. However, this is costly. Other techniques use commercial speech recognition systems that rely on keyword extraction to recognize spoken utterances and can use semantic information to extract associations between the utterance and the knowledge representation. Such techniques can be used for both speech understanding and speech generation. One example of this is Steedman, M., Wide-coverage Semantic Representations from a CCG Parser, Proceedings of the 20th International Conference on Computational Linguistics. (2004) that is incorporated by reference herein in its entirety.
Another conventional systems uses transcribed parent-child speech to perform unsupervised learning of linguistic structures from the corpus. See Solan, Z., Horn, D., Ruppin, E., Edelman, S., Unsupervised Context Sensitive Language Acquisition from a Large Corpus, http://adios.tau.ac.il/papers/soletalb2003.pdf (2003) which is incorporated by reference herein in its entirety. In this example, significant patterns of words are extracted and are represented in trees to generalize to variations in unseen text. In other system a generative probabilistic model is used for unsupervised learning of natural language syntactic structure. However, such systems do not learn a context free grammar (CFG) but, rather, induce a distributional model based on constituent identity and linear context.
A conventional grammar generation system collects structured transcripts from Wizard of Oz based user tests. Participants spoke instructions and a wizard (real person through Microsoft NetMeeting tool) captured the spoken phrase in a transcript and manually performed the requested email management task. With text parsing, they generated the Context Free Grammar (CFG). This approach required labor intensive transcript collection and the scope was limited to only a few tasks. See Sinha, A. K., Landay, J. A., Towards Automatic Speech Input Grammar Generation for Natural Language Interfaces, CHI 2000 Workshop on Natural Language Interfaces, The Hague, The Netherlands (2000) which is incorporated by reference herein in its entirety.
Another system implemented an automated customer service agent to interact with users wanting to browse and order items from an online catalog. A small representative set of utterances (on the order of hundreds of sentences or phrases) was combined with an overly permissive grammar to generate a tighter grammar for the domain. This approach required tagging of lexical entries and manual writing of rules to enforces semantic restriction among lexical entries. Such a system is described in greater detail in Martin, P., The Casual Cashmere Diaper Bag: Constraining speech recognition Using Examples, Proceedings of the Association of Computational Linguistics. (1997) which is incorporated by reference herein in its entirety.
Among free-form recognition approaches using a grammar, another system uses a system based on recognizing commonly used phrases instead of words in order to categorize responses to an open ended prompt How may I help you? They evaluated and selected phrases via perplexity minimization and clustered them using a similarity metric. Such a system is described in Riccardi, G., Bangalore, S., Automatic Acquisition of Phrase Grammars for Stochastic Language Modeling, 6th Workshop on Very Large Corpora, Montreal (1998) 186-198 which is incorporated by reference herein in its entirety.
What is needed is a system and method to automate the process of creating a grammar, e.g., a Finite State Grammar Transducer (FSGT), that maps utterances to task labels from text data contributed by volunteers over the web. Since multiple users on the web contribute knowledge it is likely to have better coverage than contributions by a small number of people exhaustively thinking of ways to ask the robot to perform a particular task.