1. Field of the Invention
The present invention is directed toward the field of information retrieval systems, and more particularly towards ordering or ranking query feedback presented to a user.
2. Art Background
An information retrieval system attempts to match user queries (i.e., the users statement of information needs) to locate information available to the system. In general, the effectiveness of information retrieval systems may be evaluated in terms of many different criteria including execution efficiency, storage efficiency, retrieval effectiveness, etc. Retrieval effectiveness is typically based on document relevance judgments. These relevance judgments are problematic since they are subjective and unreliable. For example, different judgment criteria assigns different relevance values to information retrieved in response to a given query.
There are many ways to measure retrieval effectiveness in information retrieval systems. The most common measures used are xe2x80x9crecallxe2x80x9d and xe2x80x9cprecision.xe2x80x9d Recall is defined as the ratio of relevant documents retrieved for a given query over the number of relevant documents for that query available in the repository of information. Precision is defined as the ratio of the number of relevant documents retrieved over the total number of documents retrieved. Both recall and precision are measured with values ranging between zero and one. An ideal information retrieval system has both recall and precision values equal to one.
One method of evaluating the effectiveness of information retrieval systems involves the use of recall-precision graphs. A recall-precision graph shows that recall and precision are inversely related. Thus, when precision goes up recall typically goes down and vice-versa. Although the goal of information retrieval systems is to maximize precision and recall, most existing information retrieval systems offer a trade-off between these two goals. For certain users, high recall is critical. These users seldom have means to retrieve more relevant information easily. Typically, as a first choice, a user seeking high recall may expand their search by broadening a narrow boolean query or by looking further down a ranked list of retrieved documents. However, this technique typically results in wasted effort because a broad boolean search retrieves too many unrelated documents, and the tail of a ranked list of documents contains documents least likely to be relevant to the query.
Another method to increase recall is for users to modify the original query. However, this process results in a random operation because a user typically has made his/her best effort at the statement of the problem in the original query, and thus is uncertain as to what modifications may be useful to obtain a better result.
For a user seeking high precision and recall, the query process is typically a random iterative process. A user starts the process by issuing the initial query. If the number of documents in the information retrieval system is large (e.g., a few thousand), the hit-list due to the initial query does not represent the exact information the user intended to obtain. Thus, it is not just the non-ideal behavior of information retrieval systems responsible for the poor initial hit-lists, but the user also contributes to degradation of the system by introducing error. User error manifests itself in several ways. One way user error manifests itself is when the user does not know exactly what he/she is looking for, or the user has some idea what he/she is looking for but doesn""t have all the information to specify a precise query. An example of this type of error is one where the user is looking for information on a particular brand of computer but does not remember the brand name. For this example, the user may start by querying for xe2x80x9ccomputers.xe2x80x9d A second way user error manifests itself is when the user is looking for some information generally interesting to the user but can only relate this interest via a high level concept. An on-line world wide web surfer is an example of such a user. For example, the user may wish to conduct research on recent issues related to xe2x80x9cMiddle Eastxe2x80x9d, but does not know the recent issues to search. For this example, if a user simply does a search on xe2x80x9cMiddle Eastxe2x80x9d, then some documents relevant to the user, which deal with current issues in the xe2x80x9cpetroleum industryxe2x80x9d, will not be retrieved.
Another problem in obtaining high recall and precision is that users often input queries that contain terms that do not match the terms used to index the majority of the relevant documents and almost always some of the unretrieved relevant documents (i.e., the unretrieved relevant documents are indexed by a different set of terms than those used in the input query). This problem has long been recognized as a major difficulty in information retrieval systems. See Lancaster, F. W. 1969. xe2x80x9cMEDLARS: Reports on the Evaluation of its Operating Efficiency.xe2x80x9d American documentation, 20(1), 8-36.
Prior art query feedback systems, used to supplement replaced terms in the original query, are an attempt to improve recall and/or precision in information retrieval systems. In these prior art systems, the feedback terms are often generated through statistical means (i.e., co-occurrence). Typically, in co-occurrence techniques, a set of documents in a database is examined to identify patterns that xe2x80x9cco-occur.xe2x80x9d For example, if in a particular document set, the term xe2x80x9cxxe2x80x9d is frequently found near the term xe2x80x9cy,xe2x80x9d then term xe2x80x9cyxe2x80x9d is provided as a feedback term for a query that contains term xe2x80x9cx.xe2x80x9d Thus, co-occurrence techniques identify those terms having a physical proximity in a document set. Unfortunately, physical proximity in the document set does not always indicate that the terms connote similar concepts (i.e., physical proximity is a poor indicator of conceptual proximity).
Once the feedback terms are identified through statistical means, the terms are displayed, on an output display, to help direct the user to reformulate a new query. Typically, the feedback terms are ranked (i.e., listed in an order) using the same measure of physical proximity originally used to identify the feedback terms. For example, if term xe2x80x9cyxe2x80x9d appears physically closer to term xe2x80x9cxxe2x80x9d than term xe2x80x9cz,xe2x80x9d then term xe2x80x9cyxe2x80x9d is ranked or listed before term xe2x80x9cz.xe2x80x9d Since physical proximity is often a poor indicator of conceptual proximity, this technique of ranking query feedback terms is poor. Therefore, it is desirable to implement a query feedback technique in an information retrieval system that does not utilize statistical or co-occurrence methods. In addition, to the extent that the statistical ranking methods generate a useful order, these methods are only suitable when statistical methods are used to identify query feedback terms. Accordingly, it is also desirable to utilize query feedback ranking techniques with a more general methodology applicable to all types of systems that generate query feedback.
An information retrieval system processes user input queries, and identifies query feedback, including ranking the query feedback, to facilitate the user in re-formatting a new query The information retrieval system includes a knowledge base that comprises a plurality of nodes, depicting terminological concepts, arranged to reflect conceptual proximity among the nodes. The information retrieval system processes the queries to identify a document hit list related to the query, and to generate query feedback terms. Each document includes a plurality of themes or topics that describes the overall thematic content of the document. The topics or themes are then mapped or linked to corresponding nodes of the knowledge base. At least one focal node is selected from the knowledge base, wherein a focal point node represents a concept, as defined by the relationships in the knowledge base, conceptually most representative of the topics or themes. The query feedback terms are also mapped or linked to nodes of the knowledge base. To identify a ranking for the query feedback terms, the information retrieval system determines a conceptual proximity between the focal nodes and the nodes that represent the query feedback terms, and ranks the query feedback terms from a first term closest in conceptual proximity to the focal nodes to a last term furthest in conceptual proximity from the focal nodes. The query feedback terms are then displayed to the user in the order ranked.
In one embodiment, to process the query, the information retrieval system selects a plurality of documents relevant to said query, and then selects one or more themes from said documents, wherein said themes define at least a portion of the thematic content of said documents. Thus, for this embodiment, the topics are the themes from the document hit list.