Users of computer-mediated resources always have particular goals when accessing those resources. The goals may be sharp (learn address of company) or fuzzy (be entertained) may be temporary (find a restaurant) or persistent (achieve and maintain financial independence), and may consist of multiple related or independent sub-goals. Constructing accurate models of a user's goals is a critical prerequisite to providing intelligent interaction with that user. Unfortunately, there is no monolithic, domain-independent body of knowledge that can accurately supply enough information concerning likely user mental states, to make a universal interface practical. In fact, every new capability that becomes available modifies the set of potential goals, plans, and tasks that is relevant to discourse. Consequently, a static set of models can never be satisfactory for long. User goals with respect to a given domain are tightly related to tasks that may be accomplished in that domain and to the referents or objects of those tasks. Thus, an ideal system would utilize domain specific (or sub-domain-specific) information to infer the user's mental state from his interaction, and would support easy addition of such information to an existing interface. Additionally, to be helpful, a user interface must consider the history of interaction, including previous user signals, goals and their outcomes, and must consider the information that was recently disclosed to the user, and the various ways of referring to that information. While the invention is applicable to all forms of human/computer communication, the main theoretical underpinnings are to be found in verbal discourse phenomena. Most of the following description refers to verbal discourse, but the invention contemplates applicability to virtually all non-verbal discourse as well, including mouse-actions, gestures, winks, etc. Similarly, system outputs are shown as text, tables, and graphs, but can also include generated speech, audible tones, blinking lights, and arbitrary transducers that stimulate sensory organs of the user.
Few previous computer interface systems have attempted to deduce user goals and intent, as this is a very difficult task requiring a sophisticated representation of the domain of discourse, of the user, and of the way that language is used for the given domain. Additionally, most systems are forced to ignore the context of interactions, as they do not provide a full representation of the user's previous communications, and of the information that resulted from prior interaction. Another area that other systems have neglected is that of providing users with a visual depiction of the reasoning which takes place as their communications are analyzed and interpreted. Such a visual depiction provides useful feedback for users, while simultaneously giving them an opportunity to fine-tune the system's understanding by directly reinforcing or disputing a particular assumption. No other invention disclosed to date has applied the full capability of multilevel discourse modeling to multimodal inputs, or created multimedia responses consistent and appropriate to the full spectrum of user interests and system capabilities.
Note that in much of the following discussion, the terms value, parameter, attribute, variable, and binding are used as follows:                A value is some state that is typically of interest to a software process or a human user. A value may be either a scalar or a collection.        A variable is a label for a unit whose value may change. The label may be used to designate something which will be interpreted as a parameter, or something which will be interpreted as an attribute.        A binding is the (temporary, or limited in scope) assignment of a value to a variable.        A parameter is a variable that has a particular meaning to a software process or software system or to a human. Parameters are typically communicated by position, but they may be communicated by association with some name.        An attribute is a named variable that has a particular meaning to a software process or software system or to a human. Attributes are typically communicated by name, but may be represented by position.        It is possible for an attribute which is used as a parameter to be bound to a value.        
Often, in referring to attributes, parameters, and variables, practitioners use the words metonymically in that any these terms may be used to refer to the value currently bound to the term. For instance, the sentence: “The parameter p was 3.” means that the parameter p was bound to the value 3. Similarly, the sentence “Expertise-level was 7.3.” means that the attribute ‘Expertise-level’ was bound to the value 7.3.
Several patents have addressed the meaning of text in human-computer discourse. For example, U.S. Pat. No. 5,255,386 to Prager presents a method and apparatus for intelligent help that matches the semantic similarity of the inferred intent, one of the few systems that attempts to consider user intent. The system is directed to a single and limited arena of providing help for users of computer systems. U.S. Pat. No. 5,255,386 omits a facility for domain modeling, discloses no way for composing domain knowledge, and provides no means capturing and generalizing previous or expert interactions. Prager's disclosure describes only a single, limited weighting scheme to infer best matches of potential meanings, while the invention we describe can exploit any appropriate combination of belief calculus methods to calculate the user's likely intent.
U.S. Pat. No. 6,009,459, to Belfiore, et al. describes an intelligent automatic searching for resources in a distributed environment and mentions “determining the meaning of text” in several different areas. However, the specification discloses no mechanism to represent the potential goals and intentions of a user, and describes only a surface-level syntactic analysis of the user's text, rendering the system incapable of exhibiting intelligent behavior.
U.S. Pat. No. 6,178,398 to Peterson, et al. discloses a method, device and system for noise-tolerant language understanding. This reference also mentions determination of “meanings” from input text, but is directed at correction of ill-formed input via a match function induced by machine learning techniques. However, Peterson uses no explicit domain or user model.
U.S. Pat. No. 6,006,221 to Liddy, et al. provides a multilingual document retrieval system and method using semantic vector matching, but the representation of domain knowledge in this case is merely a correlation matrix which stores the relative frequency with which given pairs of terms or concepts are used together. Also, no attempt is made to understand the unique context of the user, beyond knowing which language (e.g. English v. French) he or she is using.
Another aspect of the present invention is the language used in human-computer discourse behavior which several patents have addressed. For instance, U.S. Pat. No. 4,974,191 to Amirghodsi, et al. disclose an adaptive natural language computer interface system that uses cryptographic techniques, as well as heuristics, to map users input into the language used to control a computer program or robotic system. The system fails to achieve the requisite robustness because it is attempting to match the surface aspects of input language to output language, with no attempt to represent the meaning of the communication or the intentions of the user.
U.S. Pat. No. 5,682,539 to Conrad, et al. provides an anticipated meaning natural language interface, which is used to add a natural language interface to a computer application. The system provides no mechanism for modeling the user or the domain (beyond that of the particular application) so it cannot be used for the broad range of tasks that users wish to accomplish.
U.S. Pat. No. 5,870,701 to Wachtel describes a control signal processing method and apparatus having natural language interfacing capabilities. However, Wachtel only describes the facility to represent the surface parse of natural language input; it does not represent or consider the meaning or intention of the user who communicated that input.
U.S. Pat. No. 5,987,404 to Della Pietra, et al. recounts a statistical natural language understanding using hidden clumpings. U.S. Pat. No. 5,987,404 uses any of a variety of statistical models to learn the likely meaning of language from examples. However, the Della Pietra system has no way of relating those mappings to a model of the user, his thoughts and intentions, and to the communications peculiar to a given domain, or to the recent history of discourse.
U.S. Pat. No. 6,081,774 to de Hita, et al. discloses a natural language information retrieval system and method that consists mainly of a database to permit parsing of terms that are not easily recognized by simple morphological analysis and dictionary lookup. However, it includes no mechanism for representing domain knowledge, discourse plans and goals, or (conversational) problem-solving approaches, nor any way to compose multiple domain knowledge sources into a single repository. Thus, it does not enable or use prerequisite information to accurately assess the goals, intentions and meanings of users.
Recently, U.S. Pat. No. 6,138,100 to Dutton, et al., discloses a voice-activated connection which parses very limited verbal commands, but does not include a model of user's possible goals in a domain, or mention any mechanism to create such an explicit representation. Without such representation, and the capability of drawing inferences about user intentions, the system will never be capable of behaving as if it understands natural language queries and statements
U.S. Pat. No. 6,192,338 to Haszto, et al. described natural language knowledge servers as network resources, an invention which acts as an intermediary between the user and various web resources. This system supports some distribution of the knowledge used in interpreting the user's requests, but lacks a model of the user, his goals, or intentions. The system also lacks a model of the domain which is independent of the particular web servers with which it communicates. Because of this deficiency, the system is unable to understand requests that span multiple web servers, or to accomplish the tasks that will satisfy such requests.
An additional feature of the present invention is its multimodal capabilities. In the present context, multimodal refers to any means of conveying user input to the computer, and any means of informing the user of facts and results that ensue form his interaction. Several inventions have explored limited multimodal interactions with limited success compared with the present invention. For example, U.S. Pat. No. 5,748,841 to Morin, et al. describes a supervised contextual language acquisition system, which is aimed at teaching a user the application-specific language of a particular computer application, rather than generalized understanding and fulfillment of user requests in a broad domain. The system uses some historical model of the user and accepts a limited subset of natural language input, but lacks a model of the goals that a user might possess, the mapping of those goals to language, or to the concepts that can be referred to in a domain, beyond the strict limits of a single software application.
U.S. Pat. No. 5,781,179 to Nakajima, et al. presents a multimodal information inputting method and apparatus for embodying the same, and describes a scheme for correlating the actions of a user-directed cursor to language that is spoken concurrently. Nakajima does not, however, include any method for understanding the meaning and intentions of the user.
U.S. Pat. No. 5,748,974 to Johnson describes a multimodal natural language interface for cross-application tasks. However, this reference focuses primarily on spoken, typed or handwritten communications from users, and lacks any deep model of discourse and similarly lacks a domain model beyond the Application Programmer Interfaces (APIs) of various programs the user might want to control.
US Application 20040122653 to Mau, et al., describes a method for “linking a natural language input to an application” using a “semantic object” to resolve ambiguity. Architecturally, the Mau application includes one “application object model” for each application that is to be included, but fails to offer a mechanism for independent applications to provide linguistic or pragmatic items to the interface system. Mau also fails to support multi-modal signals from the user, fails to generalize the many useful forms of output signals to the user. Mau, et al., also fails to exploit a discourse model to infer user intentions.
US Application 20040044516 to Kennewick, et al. describes system to answer natural language queries that exploits “domain agents” to “receive, process, and respond” to a command, and exploits a history of user statements. Kennewick's architecture but fails to offer a mechanism for independent applications to provide linguistic or pragmatic items to the interface system, fails to automatically compose those items associated with different applications, and fails to support arbitrary users signals.
US Application 20030144977 to Suda, et al. describes an “information processing system which understands information and acts accordingly”. Suda's system exploits a model of individual users, and helps them to accomplish computer-related tasks. Suda's system, unlike the instant invention, presumes a monolithic “understander”, which interprets user text in terms of a model of user intentions. This approach lacks the scalability and maintainability of our invention, as the system does not obtain task and language information incrementally from individual applications.
US Application 20020111786 to Sugeno et al. describes an “everyday language-based computing system and method”, which achieves user goals via a “network-oriented language operating system”. After Sugeno's system has interpreted a user input, it works by searching for an application and loading that application into the operating system. In contrast to the instant invention, Sugeno's applications are not responsible for describing the tasks and related linguistic and pragmatic elements which relate to the tasks that they can accomplish.
U.S. Pat. No. 6,604,090 to Tackett, et al. describes a “system and method for selecting responses to user input in an automated interface program”, which interprets user input with respect to a set of pre-defined categories, and uses an intermediate language, “gerbil script” to control “virtual robots” on the user's behalf. Tackett's system thus lacks the extensibility of the current invention, which supports composition of new linguistic and pragmatic items at any time, and also permits the users to directly create scripts in their original language or in paraphrase.
U.S. Pat. No. 6,578,019 to Suda, et al. describes an “information processing system which understands information and acts accordingly”. Suda's system exploits a model of individual users, and helps them to accomplish computer-related tasks. Suda's system, unlike the instant invention, presumes a monolithic “understander” that interprets user text in terms of a model of user intentions. This approach lacks the scalability and maintainability of our invention, as the system does not obtain task and language information incrementally from individual applications.
U.S. Pat. No. 6,772,190 to Hodjat, et al. describes a “distributed parser of natural language input”, which uses a multi-agent approach to parsing, wherein various specialized agents each attempt to interpret the input. Unlike the current invention, Hodjat offers no composition of the association among users, tasks, linguistic and pragmatic items. Lacking the ability to compose models, Hodjat's distributed parser will not scale well in situations where the set of tasks and applications is frequently changing. Additionally, Hodjat's system fails to support signals, other than text and speech, from the user, and also fails to generalize the many useful forms of output signals to the user.
U.S. Pat. No. 6,829,603 to Chai, et al. describes a “system, method and program product for interactive natural dialog” which allows more than one mode of input, and uses a correspondence between “customer taxonomies” and “business taxonomies” to accomplish tasks for a user. Unlike the instant invention, Chai's system does not support the automatic composition of new linguistic or pragmatic items from applications as they become available to the system, thus limiting its scope to “manually integrated” information systems.
In addition to the cited references, there has been research conducted in this area and several published works. For example, An architecture for a generic dialogue shell, by Allen, et al. proposed “generic dialogue shell” which has design goals similar to those of the current invention. One weakness of Allen's shell is that the knowledge about a particular domain and the language, concepts, potential tasks, and constraints of that domain are separated from the modules that weigh particular interpretations of user utterances. This approach renders it impossible to maintain the requisite modularity among different facets of functionality and language. Additionally, Allen's shell offers no support for modalities other than speech, and lacks a model of the traits of the user with respect to particular domains or sub-domains. Another shortcoming of Allen's shell is that there is no provision to use a variety of belief-calculus techniques to determine the most appropriate interpretations or the style of reasoning about a given domain. Thus, potential interpretations within that domain is not an independent quality that can be delegated to some generic parser or discourse manager. Another useful innovation that Allen's architecture lacks is the ability to determine the appropriateness of an interpretation by actually performing it. In many cases, this “trial by execution” approach can resolve ambiguity quickly and accurately.
Cyc: A Large-Scale Investment in Knowledge Infrastructure, a work by Lenat, takes a widely differing approach, and may, at some point become a complementary technology. The Lenat work consists of the CYC knowledgebase, which is an effort to construct enough of the concepts and relations about commonly encountered domains, to perform what is termed “commonsense reasoning” or reasoning which is NOT confined to a particular domain or sub-domain. The CYC effort has been accumulating and encoding knowledge for about sixteen years, and may eventually, offer a practical framework for accessing multi-domain functionality. It is clear that the availability and success of the CYC knowledgebase would ultimately broaden the area of applicability of the current invention, as portions of CYC could be accessed through the World Model Agency of the current invention, and that knowledge could help the discourse planner to reason about plausible user goals and intentions.
As has been described in recent papers, Doran, Loehr, and colleagues at MITRE have been constructing a portable dialog manager that uses an information state approach, as opposed to dialog management by recognizing plans and goals. Though there are some advantages to this approach reducing model complexity, the MITRE approach does not support automatic construction of a model-based interpreter via composition of new linguistic or pragmatic items.
J. Glass, E. Weinstein, et al., describe a conversational interface constructed on top of MIT/opensource “GALAXY” architecture. This approach has been used successfully to provide question-answering for spoken inputs, but, so far, it has been limited to “hard coded” domains. That is, unlike the current invention, the galaxy-based system cannot automatically construct an inference system from linguistic and pragmatic items collected from component applications.
Nederhof and Satta describe a new approach to probabilistic parsing which exploits probabilistic context free grammars, and constructs resulting probabilistic push-down automata to accomplish the parse. The approach they describe relies purely on information about “likely” productions, and, unlike the instant invention, offers no way to directly incorporate information derived from the domain of discourse, or from a history of transitions among domains.
Dan Klein, D., Manning, C., describe a different probabilistic parsing approach, which applies the A* algorithm to extension of paths within the parse. While this approach appears to have achieved good performance, it still does not provide a mechanism to consider probabilistic domain information, or dynamic user profile information, which, in the instant invention, aids in the selection of appropriate parses.
Therefore, while several attempts have been made at creating computer interface systems, few have attempted to deduce user goals and intent. Therefore, there remains a need for a system that deduces user goals and intent while providing a full representation of the user's previous communications, the information that resulted from prior interaction, as well as a visual depiction of the reasoning which takes place as their communications are analyzed and interpreted. None of the prior art has disclosed an invention that fully exploits discourse modeling and flexible inference of user's beliefs, intentions and goals to achieve appropriate interpretations of multimodal inputs or to organize output signals in a way appropriate to a user's history and preferences.