Speech recognition and natural language understanding capabilities of mobile devices have grown rapidly in recent years. Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) allow users of electronic devices to interact with computer systems using a subset of natural language, in spoken or written form. Users interact with a virtual assistant and present queries that typically ask for information or request an action. The queries are processed by an automated agent that attempts to recognize the structure and meaning of the user's query, and when successful, to create a response and to present it to the user. The term assistant is anthropomorphic: it refers to a human-like interface that receives user queries and responds in terms that users understand; the term agent refers instead to the computer-based implementation of the functionality that the assistant presents to users. These two terms are closely related, and they are often used interchangeably.
Various approaches to the understanding of natural language input are known in the art. One of them is called syntax-based semantics. This approach starts with the use of a context-free grammar (CfG) to recognize syntactically well-formed natural language sentences while excluding ill-formed ones. Context-free grammars are well known in the art. A CfG comprises an alphabet, that consists of terminal and non-terminal symbols, and a set of production rules. Every rule has a left-hand side, which is a non-terminal symbol, and a right-hand side, which is a sequence of terminal and non-terminal symbols. Analyzing the syntactic structure of a sentence (according to a grammar) is called parsing; numerous parsing techniques are known in the art. Many classic texts discuss CfG's and their properties. In this disclosure, the right-hand side of a production rule is called a grammar pattern.
Context-free grammars focus on syntax, but they ignore semantics. A sentence can be valid according to grammar, yet meaningless. The sample sentence ‘green ideas sleep furiously’ derives from ‘<Adjective><Noun><Verb><Adverb>’ and is syntactically correct, but it violates multiple semantic constraints. Semantic constraints can be added to a context-free grammar by associating with every production rule a procedure called a semantic augment; this procedure is designed to fail when semantic constraints are violated, but it does more. The main purpose of a rule's semantic augment is to build an interpretation (a semantic representation) for an expression correctly recognized by the rule's pattern. In a syntax-based approach to semantics, the principle of compositionality states that the interpretation of the whole is obtained by combining the interpretations of the parts. In syntactic analysis, a constituent is defined as a word or group of words that function(s) as a single unit within a hierarchical structure (e.g., a context-free grammar). Constituents occur naturally in NLU systems; they have interpretations, just like queries, which are data structures that encode their meaning. In some embodiments, they have semantic types, or belong in a hierarchy of semantic types, or ontology. For example, ‘John's address’ and ‘my uncle's birthplace’ are constituents of semantic type Address, a sub-type of Location. The interpretation of a constituent, just like that of an entire query, is the internal data structure that represents (encodes) the intuitive meaning of the constituent that it represents. This data structure is the output of parsing and interpretation processes, which attempt to formally capture the actionable meaning of the constituent.
The approach broadly described above is called syntax-based semantics. At every step of application of a production rule, a rule-specific procedure is invoked, which applies semantic constraints, and (if the constraints are met) to create an interpretation of the entire pattern instance from the interpretations of the individual pattern element instances. The repeated use of such bottom-up combination procedures, ‘all the way up’ to the entire sentence, creates an interpretation of the input by mapping it to an internal data structure, which represents the input's meaning. Note, however, that the parsing and interpretation process does not guarantee a unique result; in general, a given natural language input may have multiple interpretations. The result of parsing and interpretation is, in a way, not much more than a restatement of the natural language input, but a valid input sentence is mapped to one or more internal representations suitable for further processing.
The syntax-based semantics approach is only one of several approaches known in the field. Alternative ways to approach the analysis and interpretation of natural language input include Parts-Of-Speech, pattern matching and statistical approaches, neural networks, and more techniques. A semantic parser, based on a semantic grammar, is able to reject a syntactically ill-formed input query; reject a meaningless query; recognize the structure of a well-formed, meaningful query; and in the process of recognition, to create the query's interpretation. The output of a semantic parser, or interpretation, is always a data structure built as an internal representation of a query's meaning.
Many ways have been used to represent knowledge and the associated data structures. The variables in a frame are arbitrary symbols (names) and usually have a type, such as an integer, string, an array of elements of a given type, or a pointer to, e.g., another frame of a given type. The variables in a frame are also called slots, and the terms variable and slot are used interchangeably. The type of a frame specifies the set of possible slots and their types. Often, a slot represents a role that a constituent plays. Examples of slots that are roles occur in a pattern such as ‘WHO did WHAT to WHOM, WHEN, WHERE and WHY?’ where an Actor, an Action, a Recipient, a Time, a Location and a Reason may be recognized. Slots may be optional, that is, a frame instance may provide no value for a specific slot. Some slot values may be obligatory.
For a simple example of a frame definition, a Murder frame could have slots for Victim, Weapon, Place, Date and Time, and Suspects (an array, or as multiple Suspect slots) and some additional slots. The Victim slot is required for the Murder frame (expressing that there is no murder without a victim).
When attempting to understand queries, additional steps are often needed after the parsing and interpretation of a query, and before its execution. One such step has been so identified as the co-reference resolution problem, which is generally concerned with finding that (say) a reference to a person (‘Mr. Smith’) points to the same entity as another (‘the man with the felt hat’). A number of approaches to co-reference resolution have been suggested in the literature in computational linguistics and discourse analysis. See Jurafsky and Martin, Speech and Language Processing 2nd Ed, Chapter 21, section 21.7 to 21.9 (2009).
Other issues may have to be addressed to develop a precise representation of the meaning of a sentence, sufficient to act on it. The steps that follow the parsing and interpretation of a sentence may involve deduction, common sense reasoning, world knowledge, pragmatics and more. The scope and complexity of such additional steps is quite variable from one system to another.
Today's virtual assistants, if they have conversational capabilities at all, are quite limited in their ability to handle conversation in a human-like manner. A new approach is needed to build a system for supporting virtual assistants that can understand conversation across multiple vertical domains of subject matter. Building such a system should not require natural language application developers to acquire an extensive training in linguistics or artificial intelligence; and they should be applicable to systems with very large numbers of users and very large number of domains.