1. Field of the Invention
This invention relates in general to methods and data processing system readable media, and more particularly, to data processing system-implemented methods of formulating queries and searching for a plurality of information objects and data processing system readable media having software code for carrying out those methods.
2. Description of the Related Art
A goal of information retrieval systems is to allow efficient access to selected documents or other kinds of information objects from a repository. The user of such a system may be interested in knowing the existence and location of the available information objects that are relevant to a specific request or query.
A common approach used in information retrieval systems is to associate one or more keywords with each information object. The set of all known keywords comprises the “master set” of keywords. To form a query, the user provides one or more keywords, which may or may not be drawn from the master set. The information retrieval system then returns each information object for which one or more of its associated keywords match one or more of the keywords in the query. As a further step, a mathematical formula can be applied to the number of keyword matches to provide a scalar that is associated with each information object returned by the query. The scalar serves as a “relevance score” that indicates the degree to which the particular information object matches the query. This approach can be generally termed “keyword-matching” and there are many specific embodiments used in practice. Some difficulties with the keyword-matching approach are set forth in the following paragraphs.
First, the user of the system may not know or be able to grasp all of the possible keywords in the master set. In this case, the user may provide queries that contain keywords that are not used in the master set. This reduces the effectiveness of the system, particularly when the master set includes keywords that have closely related meanings in a particular application, and a simple match cannot make use of this information. For example, assume the repository contains documents describing fruits and vegetables, and a treatise on tomatoes has been assigned the keyword “nightshade” because it also includes discussions of eggplant and potatoes. The user desiring information on tomatoes might enter a query such as “tomatoes” and this query would fail to match the treatise on the nightshade family, even though that document is relevant to the user's purpose.
Second, the mathematical formulae that are widely described and used to compute relevance scores may not take advantage of the relationships among keywords that are inherent in any specific information repository. For example, given a repository that contains documents on fruits and vegetables, systems that compute a relevance score based only on the number of keyword matches have no way to incorporate the fact that a document tagged with keywords “nightshade” and “treatise” should more closely match the query pair “tomato” and “treatise” than the query pair “lamp” and “treatise.” Attempts to address these shortcomings have been proposed, but the methods fail to fully address the problems users may encounter. Some systems have been developed that organize the keywords into a hierarchical tree structure. This, by itself, is not a solution, as will become evident in some of the paragraphs that follow.
A system described in U.S. Pat. No. 6,094,652 (“Faisal”) places keywords into a hierarchical structure. The hierarchy expresses the associations among the keywords in the repository. When responding to a user query, the system suggests keywords from the hierarchy that broaden or narrow the scope. The system also suggests keywords that represent concepts that are neither broader nor narrower but are related by means of an explicit cross-link among the nodes in the keyword hierarchy. The user can refine his or her query in an interactive and iterative fashion.
A system described in U.S. Pat. No. 6,098,066 (“Snow”) arranged the information objects into a document hierarchy (a tree data structure). Each node of the hierarchy corresponds to a category and contains at least one document. The user of the system has the option of restricting their search to the documents branching from a specific category (which these authors term a “directed” search) or searching all documents in the repository (which these authors term an “undirected” search). The user may restrict the number of documents returned by the system by focusing on a particular category, while leaving the user with the option of searching the entire repository if desired.
A system described in U.S. Pat. No. 5,991,756 (“Wu”) places documents into a hierarchical structure. The system retrieves documents that match one or more query keywords directly or match “indirectly” by being located as a child node to a document in the document hierarchy that matches directly one or more of the query terms.
A system described in U.S. Pat. No. 5,630,125 (“Zellweger”) places documents into a hierarchical structure that has one or more paths leading to a given document. The system provides an interactive method that allows the user to formulate a final query by navigating the hierarchy structure to the desired documents. Multiple paths support synonyms and allow the user to clarify word meaning in a given context.
A system described in U.S. Pat. No. 5,787,417 (“Hargrove”) is highly similar to that described by Zellweger in that it provides an interface for allowing the user to interactively navigate the hierarchy of the repository to locate the desired information objects.
A textbook by C. J. Van Rijsbergen (Information Retrieval, 2 .sup.nd Ed) describes a general strategy for information retrieval by keyword matching. It also gives the mathematical formulae that can be used to transform the combination of a “query vector” and a “document vector” into a final “relevance score” that can be used to rank the documents returned by a retrieval system according to their degree of relevance to the query.
Each of the systems in those documents has at least one limitation or disadvantage in some applications.
Systems that require the user to interactively refine their query (such as those described by Faisal, Zellweger, and Hargrove) are inherently more time consuming for the user than a system that returns results in response to a single query. Further, human interfacing with a computer costs a company valuable human resources. In some applications (such as those described in the next section), the information retrieval is automated, and there is no opportunity to refine or otherwise change the query before searching begins.
Systems that restrict the retrieved documents to those with a particular ancestry in a document hierarchical structure (such as those described by Faisal, Snow, and Wu) can fail to return relevant documents outside their hierarchical search path unless there have been many cross-links provided (such as in the system described by Faisal). Cross links must be created and maintained manually, a time-consuming and error-prone process.
Several of the prior systems do not prescribe a method for assigning a relevance score between the query and the documents in the repository (such as the systems described by Zellweger and Hargrove). It is often convenient for the users to have a relevance score to help them estimate their level of interest in the returned documents. Furthermore, systems that restrict the search path to a particular set of child nodes in the hierarchy (such as that described by Wu) cannot provide relevance scores for documents that lie outside the restricted set of child nodes. In some applications, this means that not all documents can be assigned a relevance score in response to a given query.