1. Field of the Invention
The present invention relates to a method for answering accurately and automatically queries submitted for the purposes of exploring and extracting information from an online knowledge source, such as an enterprise website, an online catalogue, a database, or a computer-based help system.
2. Background Description
It is certainly desirable to have a computer system that can respond to a user""s requests for information, where those requests may be formulated using either full sentences or simply as phrases suggesting the user""s interests, in addition to the usual means of eliciting feedback such as clicking on links, selecting from lists, sending keywords to search engines, etc. However, frequently users wishing to obtain information from an online knowledge source either
1. have more than one question they want answered and those questions are related to one another, or,
2. as they make inquiries, related questions come to mind.
Hence, one would like an effective system for answering questions that intelligently takes into account the context provided by earlier questions and the responses to them that have been provided.
It is therefore an object of the present invention to provide an interactive automated response system.
The invention is a computerized system that can respond, not just to a single query issued by a user, but instead to a query in the context of a dialog with the user. Such a system, which we term an interactive automated response system, consists of three principal components or subsystems:
1. a text categorizer whose purpose is to assign categories to text extracted from a dialog,
2. a search system whose purpose is to match text extracted from a dialog with answers, and
3. a dialog manager whose purpose is
(a) to maintain a user""s session history,
(b) to decide what text should be sent to the text categorizer and to the search system,
(c) to make use of a partially ordered category scheme to categorize each stage of the dialog based on the results returned by the other components, and
(d) to use the results of dialog categorization, as well as the results returned by the other components to create suitable responses to the user""s query in the context of his or her earlier queries.
Before going further, we would like to clarify our use of terminology:
1. By a query we mean both the contents of a communication sent by a user to an interactive automated response system and, within the interactive automated response system, data sent by the dialog manager to the search system to be used in finding a matching answer. The context will always make clear which meaning is intended.
2. By a response we mean the contents of a communication sent by an interactive automated response system to a user in order to satisfy a query from the user. The response may include the best answer found by the search system (according to the scores of the matching answers) as well as a list of relevant categories (ranked by confidence level) found by the text categorizer from which a subsequent selection may be made by the user if the displayed response does not fully meet the user""s needs.
3. By an answer we mean data found by a search system that matches a query sent to it by a dialog manager. A set of answers, if any matches are found, can be used by the dialog manager, in determining the best response to a query from the user.
4. By a category we mean a class of answers. Each answer normally belongs to one or more categories. A text categorizer assigns categories to text extracted from a dialog in order to determine the current subjects of the dialog. So, while a category normally is used to describe a class of possible answers, a category can also be viewed as a description of a current subject of the user""s query. We refer to the set of categories as the category scheme, and its exact nature will depend on the answers available from the online knowledge source and the kinds of inquiries anticipated.
Thus, an interactive automated response system requires the following data as part of its set-up to be fully functional:
1. a category scheme,
2. data to be used by the text categorizer (e.g., a rule file for a text categorizer that works by applying symbolic rules),
3. data to be used by search engine for finding matches to queries, such as an index of which keywords are relevant to which answers, as well as a description of which answers belong to which categories,
4. data in the form of actual answers or pointers, such as Universal Resource Locators (URLs), specifying where actual answers may be found, to be used by the dialog manager to create responses.
This data may be stored in various eXtended Markup Language (XML) files, as just one possibility.
Part of this invention is the use of a category scheme that is endowed with a partial order. We call the partial ordering subsumption. Thus, for categories X and Y, when we write Xxe2x89xa6Y, we mean X is subsumed by Y, or, in other words, X is a more specific category than Y. The idea of a partial order is quite general and includes as a special case a hierarchy of categories given by a tree. Since the partial order can be used to determine which of two categories may be more specific than the other, using a partially ordered set of categories enables the system to simplify the set of categories assigned to a stage of a dialog by discarding all but the most specific categories. This is important if the system is to come up with a response that is both specific and appropriate to the query. Creating data structures, accessible by the dialog manager, that define both the category scheme and the partial order on the category scheme is an integral part of setting up the system.
The intended mode of operation of an interactive automated response system is as follows. At each stage of the dialog, i.e., after each user input is received, the dialog manager extracts one or more texts from its record of dialog, sends those texts to the text categorizer and then uses the results of categorizing the texts to assign one or more categories to the latest stage of the dialog. There are two uses made of the categories found at this point:
1. One use of the categories found at this point is to narrow provisionally the set of possible answers deemed relevant to the user""s latest query.
2. Another use of the categories found is to determine what text is to be sent by the dialog manager to the search system. By comparing the currently assigned categories with those previously assigned, one can detect whether the user is drilling down (asking about a more specific subject) or switching topics. If the user is drilling down, then one way to search for an answer in context would be to base the search on a combination of the current query with an earlier query or queries. If the user is switching topics, then the current query, in isolation from the earlier queries, should be used as the basis for the search for answers.
At any rate, the dialog manager sends queries in the form of texts that are deemed to be the best evidence for the category assignments so far made to the search system. For each query sent to the search system, the search system returns any matches (answers or answer IDs) together with the categories to which each matching answer belongs, as well as, for each match, a score indicating the relative degree of fit of the match to the query. The final categories assigned to a stage of a dialog may depend, not only on the categories found by text categorization, but also on the categories of answers found by the search system. The assignment of categories to a stage of a dialog, in the context of the entire dialog up to that point, is what we term dialog categorization. We address in more detail below how dialog categorization differs from the related problems of text categorization and topic detection in text. The details of dialog categorization will also be discussed in detail. Finally, based on the categories assigned and the answers found, a response is sent by the dialog manager to the user, which can involve several components, such as
1. a display of the answer deemed best, which may be a web page (displayed in a browser or otherwise), a video clip, audio, images, a text file, etc.
2. a listing of links to answers deemed related to the query from which the user can choose for display purposes,
3. a list of categories of answers deemed related to the query from which the user can choose to display subcategories and/or links to specific answers associated with a listed category,
4. an offer to the user of a chance to add more text, thereby refining his query or switching to a new topic,
5. an offer to the user of a chance to revisit an earlier stage of the dialog,
6. an offer to the user of help on how to use the system, and/or item an offer to the user of a chance to start a new session.
Because the dialog at any stage may be assigned multiple categories, there is a certain amount of complication that must be sorted through in order to implement the mode of operation described above. To handle the various complications, one capability that we deem essential for the dialog manager is the maintenance of a user""s session history, by which we mean a history of the user""s current dialog with the system. The session history should contain
1. each previous user input, which may be either a natural language query or a user-made choice elicited by an earlier response sent by the system to the user,
2. for each previous user input, the set of categories assigned to that input,
3. for each previous user input and for each category assigned that user input, the textual evidence for that categorization assignment, and
4. for each previous user input, the set of answers or answer IDs used in determining the response sent to the user.
The text categorizer used by an interactive automated response system may be any system that assigns categories to data containing text, such as one that applies symbolic rules, one that uses decision trees, one that uses a linear separator, a Bayesian classifier, etc. The text categorizer may be developed using machine learning techniques, applied to training data, or it may be constructed by hand. The principle requirement is that the text categorizer be able reliably and efficiently to assign predetermined categories to data objects consisting of text. In the preferred embodiment, the text categorizer should also return a confidence level with each category assigned to text data submitted to the text categorizer, where the confidence level is a quantitative estimate of the degree of confidence in or the degree of utility of the assignment. The confidence level can be used by the dialog manager to plan the layout of the eventual response to the user, with more prominence given to parts of the response related to categories, with higher confidence levels. The search system used by an interactive automated response system may be any system that matches text to answers, returning either actual answers or answer IDs. The answers themselves might be text, web pages, video clips, audio files, etc. One component of the search system needs to be a file or database containing information about answers that could be used in composing responses to users. The information stored corresponding to a particular answer includes a representation of that answer as a list of keywords or features, the presence of which in a query would be regarded as partial evidence that the answer matched the query. The information stored corresponding to a particular answer also includes a description of the set of categories to which that answer belonged. Normally, the search system would be a document matching system in which the text supplied to the search system would be treated as a queries that would be compared with keywords extracted from answers and/or from data related to the answers, such as example questions having a particular answer. For each text submitted to the search system, those answers that matched would be returned, and with each matching answer the categories to which that answer belonged would be returned, as well as a score indicating the closeness of the match.