1. Field of the Invention
This invention relates to the creation of grammar networks that regulate, control, and define the content and scope of human-machine interaction in natural language voice user interfaces (NLVUI). More particularly, the invention relates to phrase-based modeling of generic structures of verbal interaction and use of these models for the purpose of automating part of the design of such grammar networks. Most particularly, the invention relates to the use of such grammar networks in providing a voice-controlled user interface to human readable text data that is also machine readable (such as a Web page, a word processing document, a PDF document, or a spreadsheet).
2. Related Art
Voice user interfaces enable control of devices via voice commands transmitted through a microphone or telephone handset and decoded by a speech recognizer. These interfaces supplement or replace conventional input modalities such as a keyboard or a telephone touch-tone pad, and are increasingly deployed in a wide range of situations, where keyboard input is either inconvenient or impossible, e.g., to control home appliances, automotive devices, or applications accessed via the telephone. In recent years, a number of routine over-the-phone transactions such as voice dialing and collect call handling, as well as some commercial call center self-service applications, have been successfully automated with speech recognition technology. Such systems allow users to remotely access, for example, a banking application or ticket reservation system, and to retrieve information or complete simple transactions by using voice commands. Increasingly, voice control is being deployed to access the Internet by phone for the purpose of retrieving information or completing Internet-based commercial transactions such as making an on-line purchase.
a. Limitations and Unsolved Problems in Current Technology
Current technology limits the design of voice-controlled user interfaces in terms of both complexity and portability. Systems must be designed for a clearly defined task domain, and users are expected to respond to system prompts with short, fixed voice commands. Systems typically work well as long as vocabularies remain relatively small (200-500 words), choices at any point in the interaction remain limited and users interact with the system in a constrained, disciplined manner.
There are two major technological barriers that need to be overcome in order to create systems that allow for more spontaneous user interaction: (1) systems must be able to handle more complex tasks, and (2) the speech interface must become more “natural” if systems are expected to perform sophisticated functions based on unrestrained, natural speech or language input.
A major bottleneck is the complexity of the recognition grammar that enables the system to recognize natural language voice commands, interpret their meaning correctly, and respond appropriately. As indicated above, this grammar must anticipate, and thus explicitly spell out, the entire virtual space of possible user requests and/or responses to any given system prompt. To keep choices limited, the underlying recognition grammars typically process requests in a strictly predetermined, menu-driven order.
Another problem is portability. Current systems must be task specific, that is, they must be designed for a particular domain. An automated banking application cannot process requests about the weather, and, conversely, a system designed to provide weather information cannot complete banking transactions. Because recognition grammars are designed by hand and model domain specific rather than generic machine-human interaction, they cannot be easily modified or ported to another domain. Reusability is limited to certain routines that may be used in more than one system. Such routines consist of subgrammars for yes-no questions or personal user data collection required in many commercial transactions (e.g., for collecting name, addresses, credit card information, etc.). Usually, designing a system in a new domain means starting entirely from scratch.
Even though the need for generic dialogue models is widely recognized and a number of systems claim to be portable, no effective and commercially feasible technology for modeling generic aspects of conversational dialogue currently exists.
b. Current System Design and Implementation
The generated dialogue flow and the recognition grammar can be dauntingly complex for longer interactions. The reason is that users always manage to come up with new and unexpected ways to make even the simplest request, and all potential input variants must be anticipated in the recognition grammar. Designing such recognition grammars, usually by trained linguists, is extremely labor-intensive and costly. It typically starts with a designer's guess of what users might say and requires hours of refinement as field data is collected from real users interacting with a system simulation or a prototype.
c. Stochastic Versus Rule-Based Approaches to Natural Language Processing
Since its beginnings, speech technology has oscillated between rule-governed approaches based on human expert knowledge and those based on statistical analysis of vast amounts of data. In the realm of acoustic modeling for speech recognition, probabilistic approaches have far outperformed models based on expert knowledge. In natural language processing (NLP), on the other hand, the rule-governed, theory-driven approach continued to dominate the field throughout the 1970's and 1980's.
In recent years, the increasing availability of large electronic text corpora has led to a revival of quantitative, computational approaches to NLP in certain domains.
One such domain is large vocabulary dictation. Because dictation covers a much larger domain than interactive voice-command systems (typically a 30,000 to 50,000 word vocabulary) and does not require an interpretation of the input, these systems deploy a language model rather than a recognition grammar to constrain the recognition hypotheses generated by the signal analyzer. A language model is computationally derived from large text corpora in the target domain (e.g., news text). N-gram language models contain statistical information about recurrent word sequences (word pairs, combinations of 3, 4, or n words). They estimate the likelihood that a given word is followed by another word, thus reducing the level of uncertainty in automatic speech recognition. For example, the word sequence “A bear attacked him” will have a higher probability in Standard English usage than the sequence “A bare attacked him.”
Another domain where probabilistic models are beginning to be used is automated part-of-speech analysis. Part-of-speech analysis is necessary in interactive systems that require interpretation, that is, a conceptual representation of a given natural language input. Traditional part-of-speech analysis draws on explicit syntactical rules to parse natural language input by determining the parts of an utterance and the syntactic relationships among these parts. For example, the syntactical rule S→NP VP states that a sentence S consists of a noun phrase NP and a verb phrase VP.
Rule-based parsing methods perform poorly when confronted with syntactically ambiguous input that allows for more than one possible syntactic representation. In such cases, linguistic preferences captured by probabilistic models have been found to resolve a significant portion of syntactic ambiguity.
Statistical methods have also been applied to modeling larger discourse units, such as fixed phrases and collocations (words that tend to occur next to each other, e.g. “eager to please”). Statistical phrase modeling involves techniques similar to the ones used in standard n-gram language modeling, namely, collecting frequency statistics about word sequences in large text corpora (n-grams). However, not every n-gram is a valid phrase: for example, the sequence “the court went into” is a valid 4-gram in language modeling, but only “the court went into recess” is a phrase. A number of different methods have been used to derive valid phrases from n-grams, including syntactical filtering, mutual information, and entropy. In some cases, statistical modeling of phrase sequences has been found to reduce lexical ambiguity. Others have used a phrase-based statistical modeling technique to generate knowledge bases that can help lexicographers to determine relevant linguistic usage.
Experiments in training probabilistic models of higher-level discourse units on conversational corpora have also been shown to significantly reduce the perplexity of a large-vocabulary continuous speech recognition task in the domain of spontaneous conversational speech. Others have modeled dialogue flow by using a hand-tagged corpus in which each utterance is labeled as an IFT (illocutionary force type). Probabilistic techniques have also been used to build predictive models of dialogue structures such as dialogue act sequences. The bottleneck in all of these experiments is the need for hand-tagging both training and testing corpora.
Another recent application of a probabilistic, phrase-based approach to NLP has been in the field of foreign language pedagogy, where it has been proposed as a new method of teaching foreign languages. Michael Lewis, in his book, Implementing The Lexical Approach (Hove, Great Britain, 1997) challenges the conventional view that learning a language involves two separate cognitive tasks: first, learning the vocabulary of the language, and second, mastering the grammatical rules for combining words into sentences. The lexical approach proposes instead that mastering a language involves knowing how to use and combine phrases in the right way (which may or may not be grammatical). Phrases, in Lewis' sense are fixed multi-word chunks of language, whose likelihood of co-occurring in natural text is more than random. Mastering a language is the ability of using these chunks in a manner that produces coherent discourse without necessarily being rule-based.