Field of Invention
The present invention relates to a method for disambiguating user intent in conversational interaction system for information retrieval, and more specifically, related to techniques of using structural information and user preferences.
Brief Description of Related Art and Context of the Invention
The present invention relates to a method for “optimally” and “intelligently” disambiguating user intent/input in a conversational interaction system for large corpus information retrieval where the intent/input has one or more of the following ambiguities (1) lexical ambiguity (multiple qualifying responses lexically matching user input) or (2) semantic ambiguity where the ambiguity is in time (multiple qualifying responses based on temporal attribute), or ambiguity is in location (multiple qualifying responses based on location attribute), or ambiguity is in any content attribute or combination of content attributes (multiple qualifying responses based on the content attribute/attributes specified by user) or just ambiguity arising out of non-specificity inherent in user's request (e.g. a broad intent request) which in turn results in multiple qualifying responses. Implementations of the “optimal” disambiguation method described in the present disclosure enables the system to ask the minimum number of clarifying questions (in the ideal case, not asking any question at all) to understand user intent. Implementations of the “intelligent” disambiguation method described in the present disclosure enable the system to make a disambiguation exchange natural, similar to the way humans clarify an ambiguity in a conversation. The system makes use of domain specific structural knowledge, time, location of user (if available), and signature of user preferences (if available) to perform optimal and intelligent disambiguation. The methods described in the present disclosure are language independent and can be applied to multiple languages, with the support of a language specific module. Furthermore, the methods disclosed herein are especially suited for large corpus information repositories with high semantic ambiguity and conflict caused by the fact that a given entity or concept of the repository may be referred in many ways and the same term may appear in the context of different meanings.
The key performance metric of a conversational system is not how well its response matches user intent when user intent/input is unambiguous, but how it responds when the user's intent/input has ambiguity. A good conversational system does not have, in its repertoire of possible response strategies, the luxury of spewing out a multitude of responses, like a search engine would spew results, even if there is ambiguity in user input/intent. For a search engine, showing all results for ambiguous input/intent, in some order of relevance, would be extolled as the hallmark of a good search engine. Adopting the same approach for ambiguous user input/intent in a conversational system, would be equivalent to the bewildering encounter in Starbucks with an overzealous sales person, who reels off ten coffee choices, just because user absent-mindedly failed to qualify the particular kind of coffee (e.g. Caffé Latte) user had in mind. Here, even though the salesperson clearly understood the intent was coffee (and not tea), the salesperson was not mindful of the fact that there are many choices matching the intent—a more savvy salesperson would have probably said “Oh there are many choices of coffee, would you like me to give you a quick run-down of your choices?”
The present disclosure uses the term “ambiguity” in a broad sense to capture the scenario when there are multiple qualifying responses (with one exception mentioned below) matching user input. One example of the meanings of the term “ambiguous” as used in this disclosure can be understood from the following examples: While a good conversational system would strive to understand user intent and generate the most succinct targeted response which, depending on the question, may ideally be just one succinct response (e.g. the question “is sox playing tonight?” could generate a response just showing time and location of the red sox game, where red sox was inferred from user's signature), it does not necessarily imply all user questions generate a single response. Neither is it implied offering multiple choices to a question would be sub-optimal. For instance, if a user states “show me Starbucks nearby”, the best response would be the display of a map plotted with all Starbuck results close to the user, so user can pick any one effortlessly from the visual map. Even for a broader intent request such as “Show me restaurants nearby,” displaying multiple responses on a map is the best response a system can provide.
The intent is clear in both these cases—but the response is in a sense “ambiguous”, because its more than one—system does not know which particular restaurant user may like. Though if there is a signature of user preferences, it could generate a response with the most preferred Starbucks/restaurant highlighted from the other responses. The multiple responses in these cases mentioned above are not really ambiguous responses, but a palette of “choices” that all match user intent (granted user may still not choose a Starbucks or a restaurant, for subjective reasons). The word “choices” is used here to distinguish from “responses”, to show that user intended multiple choices—not just one choice (even if system had signature of user preferences, it would still offer multiple “choices”). Another example is—“show me movies of Meryl Streep.” In this case, user wanted multiple movies of Meryl Streep to “choose” from.
The methods described in the present disclosure focus on the cases where the ambiguity (or multiple qualifying responses) stems from the inability to offer one clear “choice” or a palette of “choices” that can be known, with a good degree of confidence, to match user intent. Furthermore, when the user intended a particular choice or choices, the burden is on the system, despite lexical and/or semantic ambiguity to pick that particular choice or choice set. This ambiguity is not due to the deficiency or “lack of intelligence” of the system, but due to the inherent ambiguity (lexical or semantic) in the very question posed by the user.
The methods described in the present disclosure focus on the disambiguation method for these ambiguous cases where it is not possible to offer a set of choices due to inherent ambiguity in user intent/input. The Starbucks/restaurant and the “Meryl Streep” responses are best case scenarios, with no need for ambiguity resolution. The system responses are just as good as the succinct response to the question “is there a sox game tonight” mentioned above—the multiple responses are “choices” and not ambiguity in response.
The word “ambiguity” is also used in the present disclosure to handle an exception case—when there are no responses at all matching user intent/input. In this boundary condition, the ambiguity could be due to a variety of reasons ranging from user not expressing intent correctly or just that there is no match in the information domain spaces. For instance, if user asked “is there a sox game tonight”, and there isn't any sox game, then that is a case where there is nothing to match user's intent of wanting to watch a game.
From a strict request/response there is no ambiguity here. But in human interactions, when user expresses an intent that cannot be satisfied, a reasonable question arises “can I offer user something that could come close to satisfying original intent?” Typically, a response that offers a close alternative is often appreciated. In the case of the “is there a sox game tonight”, a response “there isn't one tonight, but there was a game last night that you missed” (this response can be created using signature of user' preferences and past history). Embodiments of the present invention treat this case of no responses as “a null response ambiguity” case, and generate responses that are a best effort to get closer to satisfying user intent. Another example is, “Did X and Y act together in a play?” Assuming X and Y never acted together in a play, implementations of the present invention would make use of domain specific structural knowledge to generate “No, they didn't act together in a play, but they did star together in a movie Z, back in 1989”. Here the domain specific structural knowledge is used to generate a response to a “null response ambiguity” case.
Most of the examples of ambiguity described in the present disclosure are based on the digital entertainment space predominantly. However, the methods described in the present disclosure can apply to any information verticals (entertainment, personal corpus such email, contacts etc.), and also across different information verticals.
The ambiguity in user intent/input could be of different kinds One possibility is lexical ambiguity in user input, but user had clear intent. For instance, assume user says “I would like to watch the Beethoven movie”. Three movies qualify for “Beethoven movie”—1936 film about the composer Beethoven, a 1992 film about a dog named Beethoven, or a famous movie in 1994 about Beethoven, “Immortal Beloved”. User's intent was clearly just one of these movies (based on the use of “the” in the request), but user's input lexically matched three qualifying responses. A good conversational system would never offer, in this case, these three qualifying responses as three equally valid choices for user to pick one from. Such a system would be a conversation system whose performance has degenerated to a search engine performance offering results—it will be apparent the system has no internal understanding of the term Beethoven, other than perhaps some relevance metric.
A conversational system that strives to inch closer to conversations between humans, would ask the user “Do you mean movie about the composer or the movie about the dog?”—much like a human would respond in a conversation. The disambiguating question itself is an indicator that the conversational system understands the term Beethoven, more like a human being does. For instance, the same disambiguating question could have been framed “Do you mean Beethoven the composer or Beethoven the dog?” While this is still a good disambiguating response, the previous response is closer to normal speech, where the very term that is ambiguous, namely Beethoven, is dropped from the disambiguating response to user. In summary, a good conversational system would be particularly sensitive in its response to lexical ambiguity and generate disambiguating responses that are more human like, since such a response is a key metric in deciding the caliber of the system where the scale could range from “a search engine intelligence” to a “natural conversation intelligence”.
Another form of ambiguity is semantic ambiguity in time or linear continuity. If user says “I would like to watch Borghia” (a TV series), there is an ambiguity in the season user wants to watch, though the current season would be considered a reasonable response in most cases. However, if user had been watching the series from the first season, then ideally the season following the one last watched would be ideal. This form of ambiguity can also arise when user is in the process of watching a sequential series of content (like David Attenborough's nature series “Life on earth”). The resolution of the ambiguity in that case is ideally resolved by also starting with the episode user last viewed. In either of the cases (seasons or linear series) if user had not been watching in temporal or linear sequence, then a disambiguating question is inevitable. However, if user said “I would like to watch the next Borghia”, then the user intent could be interpreted to mean the episode following the one user last watched.
Another form of ambiguity is ambiguity in location resulting in multiple qualifying responses. For instance, the request, “Show me the Spielberg movie shot in Hawaii,” would result in the multiple movies Jurassic Park and its sequels—Lost world, and Jurassic Park III—all shot in locations in Hawaii. User intended only one here by asking for “the Spielberg movie”. A response that is closer to a human response would be “Jurassic park was shot in Hawaii. Its sequels “Lost World” and “Jurassic Park III” were shot there too.
In another example, if user asks “is there a tiger's game tonight”, user could have meant the Detroit Tigers baseball team or the Louisiana Tigers football team (Louisiana Tigers football team is more popular than the baseball team with the same name). However, if the user's location is known to be in Louisiana, it is most likely user meant Louisiana football team. However, if user's location is known to be in Detroit, then the question could map to the Detroit baseball team. In the event user is travelling and the location is not known, then there is an ambiguity in the question that needs to be resolved, particularly when there is no prior information about user's preference to either one of these teams. Furthermore, if the question was posed during the game season, then that could be a disambiguating factor too, in addition to location. In general, there could be ambiguity in any attribute specified by user, not just location and time—the examples above show ambiguity in attributes such as location and time.
There could also be ambiguity in understanding user intent, from the very broadness of intent. For instance, if user says, “I would like to watch a movie tonight”, even if signature of user preferences are known, user may be interested in action or mystery movies. So there is still an ambiguity between these two genre types that needs to be resolved. A disambiguation scheme used in some existing conversational systems is to walk user down a multilevel decision tree posing questions to user to narrow down the choice. This “algorithmic tree walk approach” is never done by humans in a natural conversation, making that strategy unacceptable for a conversational system that strives to be close to natural conversations. Such a multilevel decision tree walk may be acceptable to some degree for some domains such as an airline reservation process, but it would look comically silly when applied in certain domains such as entertainment space.
Ambiguity could also arise from errors in inputting user's intent, where the input could be speech or text input. Those errors are deemed, for the purposes of the methods described in this disclosure, lexical errors (though a lexical error may actually result in a semantic difference in some cases). Resolution of ambiguity described in the present disclosure leverages off domain specific structural knowledge, signature of user preferences (if available), user's location (if available) and time. However, clearly not all ambiguities are resolvable as seen in the examples above.
To summarize, the ambiguity in user input/intent may lead to qualifying responses (with the exception of “null response” case) that can be loosely correlated with each other as would be the case of lexical ambiguity (e.g. Beethoven the movie may match the movie about the musician or about a dog named Beethoven). In the other extreme, ambiguity in user input/intent may lead to qualifying responses that can be closely correlated with each other to the extent that the multiple responses are more like “choices”—all closely correlated, and with a high degree of probability of matching user intent (e.g. the responses to “show me Starbucks close by”). Furthermore, when the user intent is broad, the qualifying responses are potentially quite large, necessitating a disambiguating response to user. Embodiments of the conversational system described in the present invention respond to user in a conversation based on the nature of the ambiguity (lexical or semantic ambiguity) and the degree of correlation of qualifying responses with each other, by making use of domain specific structural knowledge, time, location of user (if available) and signature of user preferences (if available). The conversation exchange that ensues to disambiguate user intent strives to approach the ideal goal of the fluidity of human conversations where disambiguation is woven seamlessly into the very fabric of the exchanges, and doesn't interrupt the seamless flow by standing out because of artifacts of its machine generated origin. Embodiments of the conversational system described in the present disclosure also address the “null response ambiguity” case so user is not left in a dead end with an unfulfilled intent.