Field of Invention
The invention generally relates to conversational interaction techniques, and, more specifically, to inferring user input intent based on resolving input ambiguities and/or inferring a change in conversational session has occurred.
Description of Related Art
Conversational systems are poised to become a preferred mode of navigating large information repositories across a range of devices: Smartphones, Tablets, TVs/STBs, multi-modal devices such as wearable computing devices such as “Goggles” (Google's sunglasses), hybrid gesture recognition/speech recognition systems like Xbox/Kinect, automobile information systems, and generic home entertainment systems. The era of touch based interfaces being center stage, as the primary mode of interaction, is perhaps slowly coming to an end, where in many daily life use cases, user would rather speak his intent, and the system understands and executes on the intent. This has also been triggered by the significant hardware, software and algorithmic advances making text to speech significantly effective compared to a few years ago.
While progress is being made towards pure conversation interfaces, existing simple request response style conversational systems suffice only to addresses specific task oriented or specific information retrieval problems in small sized information repositories—these systems fail to perform well on large corpus information repositories.
Current systems that are essentially request response systems at their core, attempt to offer a conversational style interface such as responding to users question, as follows:                User: What is my checking account balance?        System: It is $2,459.34.        User: And savings?        System: It is $6,209.012.        User: How about the money market?        System: it is $14,599.33.        
These are inherently goal oriented or task oriented request response systems providing a notion of continuity of conversation though each request response pair is independent of the other and the only context maintained is the simple context that it is user's bank account. Other examples of current conversational systems are ones that walk user through a sequence of welt-defined and often predetermined decision tree paths, to complete user intent (such as making a dinner reservation, booking a flight etc.)
Applicants have discovered that understanding user intended within a domain such as digital entertainment where user intent could span from pure information retrieval, to watching a show, or reserving a ticket for a show/movie), combined with understanding the semantics of the user utterance expressing the intent, so as to provide a clear and succinct response matching user intent is a hard problem that present systems in the conversation space fall short in addressing. Barring simple sentences with clear expression of intent, it is often hard to extract intent and the semantics of the sentence that expresses the intent, even in a single request/response exchange style interaction. Adding to this complexity, are intents that are task oriented without having well defined steps such as the traversal of a predetermined decision tree). Also problematic are interactions that require a series of user requests and system responses to get to the completion of a task (e.g., like making a dinner reservation). Further still, rich information repositories can be especially challenging because user intent expression for an entity may take many valid and natural forms, and the same lexical tokens (words) may arise in relation to many different user intents.
When the corpus is large, lexical conflict or multiple semantic interpretations add to the complexity of satisfying user intent without a dialog to clarify these conflicts and ambiguities. Sometimes it may not even be possible to understand user intent, or the semantics of the sentence that expresses the intent—similar to what happens in real life conversations between humans. The ability of the system to ask the minimal number of questions (from the point of view of comprehending the other person in the conversation) to understand user intent, just like a human would do (on average where the participants are both aware of the domain being discussed), would define the closeness of the system to human conversations.
Systems that engage in a dialog or conversation, which go beyond the simple multi-step travel/dinner reservation making (e.g., where the steps in the dialog are well defined request/response subsequences with not much ambiguity resolution in each step), also encounter the complexity of having to maintain the state of the conversation in order to be effective. For example, such systems would need to infer implicit references to intents and entities (e.g., reference to people, objects or any noun) and attributes that qualify the intent in user's sentences (e.g., “show me the latest movies of Tom Hanks and not the old ones; “show me more action and less violence). Further still, applicants have discovered that it is beneficial to track not only references made by the user to entities, attributes, etc. in previous entries, but also to entities, attributes, etc. of multi-modal responses of the system to the user.
Further still, applicants have found that maintaining pronoun to object/subject associations during user/system exchanges enhances the user experience. For example, a speech analyzer (or natural language processor) that relates the pronoun “it” to its object/subject “Led Zeppelin song” in a complex user entry, such as, “The Led Zeppelin song in the original sound track of the recent Daniel Craig movie. Who performed it?” assists the user by not requiring the user to always use a particular syntax. However, this simple pronoun to object/subject association is ineffective in processing the following exchange:                Q1: Who acts as Obi-wan Kenobi in the new star wars?        A1: Ewan McGregor.        Q2: How about his movies with Scarlet Johansson?        
Here the “his” in the second question refers to the person in the response, rather than from the user input. A more complicated example follows:                Q1: Who played the lead roles in Kramer vs. Kramer?        A1: Meryl Streep and Dustin Hoffman.        Q2: How about more of his movies?        A2: Here are some of Dustin Hoffman movies . . . [list of Dustin Hoffman movies].        Q3: What about more of her movies?        
Here the “his” in Q2 and “her” in Q3 refer back to the response A1. A natural language processor in isolation is ineffective in understanding user intent in these cases. In several of the embodiments described below, the language processor works in conjunction with a conversation state engine and domain specific information indicating male and female attributes of the entities that can help resolve these pronoun references to prior conversation exchanges.
Another challenge facing systems that engage a user in conversation is the determination of the user's intent change, even if it is within the same domain. For example, user may start off with the intent of finding an answer to a question, e.g., in the entertainment domain. While engaging in the conversation of exploring more about that question, decide to pursue a completely different intent path. Current systems expect user to offer a clear cue that a new conversation is being initiated. If the user fails to provide that important clue, the system responses would be still be constrained to the narrow scope of the exploration path user has gone down, and will constrain users input to that narrow context, typically resulting undesirable, if not absurd, responses. The consequence of getting the context wrong is even more glaring (to the extent that the system looks comically inept) when user chooses to switch domains in the middle of a conversation. For instance, user may, while exploring content in the entertainment space, say, “I am hungry?”). If the system does not realize this as a switch to a new domain (restaurant/food domain), it may respond thinking “I am hungry” is a question posed in the entertainment space and offer responses in that domain, which in this case, would be a comically incorrect response.
A human, on the other hand, naturally recognizes such a drastic domain switch by the very nature of the statement, and responds accordingly (e.g., “Shall we order pizza?”). Even in the remote scenario where the transition to new domain is not so evident, a human participant may falter, but quickly recover, upon feedback from the first speaker (“Oh no. I mean I am hungry—I would like to eat!”). These subtle, yet significant, elements of a conversation, that humans take for granted in conversations, are the ones that differentiate the richness of human-to-human conversations from that with automated systems.
In summary, embodiments of the techniques disclosed herein attempt to closely match user's intent and engage the user in a conversation not unlike human interactions. Certain embodiments exhibit any one or more of the following, non-exhaustive list of characteristics: a) resolve ambiguities in intent and/or description of the intent and, whenever applicable, leverage off of user's preferences (some implementations use computing elements and logic that are based on domain specific vertical information); b) maintain state of active intents and/or entities/attributes describing the intent across exchanges with the user, so as to implicitly infer references made by user indirectly to intents/entities/attributes mentioned earlier in a conversation; c) tailor responses to user, whenever applicable, to match user's preferences; d) implicitly determine conversation boundaries that start a new topic within and across domains and tailor a response accordingly; e) given a failure to understand user's intent (e.g.; either because the intent cannot be found or the confidence score of its best guess is below a threshold), engage in a minimal dialog to understand user intent (in a manner similar to that done by humans in conversations to understand intent.) In some embodiments of the invention, the understanding of the intent may leverage off the display capacity of the device (e.g., like a tablet device) to graphically display intuitive renditions that user could interact with to offer clues on user intent.