1. Field of the Invention
The present invention is directed toward the field of search and retrieval systems, and more particularly to a knowledge base search and retrieval system.
2. Art Background
In general, search and retrieval systems permit a user to locate specific information from a repository of documents, such as articles, books, periodicals, etc. For example, a search and retrieval system may be utilized to locate specific medical journals from a large database that consists of a medical library. Typically, to locate the desired information, a user enters a xe2x80x9csearch stringxe2x80x9d or xe2x80x9csearch query.xe2x80x9d The search query consists of one or more words, or terms, composed by the user. In response to the query, some prior art search and retrieval systems match words of the search query to words in the repository of information to locate information. Additionally, boolean prior art search and retrieval systems permit a user to specify a logic function to connect the search terms, such as xe2x80x9cstocks AND bondsxe2x80x9d, or xe2x80x9cstocks OR bonds.xe2x80x9d
In response to a query, a word match based search and retrieval system parses the repository of information to locate a match by comparing the words of the query to words of documents in the repository. If there is an exact word match between the query and words of one or more documents, then the search and retrieval system identifies those documents. These types of prior art search and retrieval systems are thus extremely sensitive to the words selected for the query.
The terminology used in a query reflects each individual user""s view of the topic for which information is sought. Thus, different users may select different query terms to search for the same information. For example, to locate information about financial securities, a first user may compose the query xe2x80x9cstocks and bondsxe2x80x9d, and a second user may compose the query xe2x80x9cequity and debt.xe2x80x9d For these two different queries, a word match based search and retrieval system would identify two different sets of documents (i.e., the first query would return all documents that have the words stocks and bonds and the second query would return all documents that contain the words equity and debt). Although both of these query terms seek to locate the same information, with a word search and retrieval system, different terms in the query generate different responses. Thus, the contents of the query, and subsequently the response from word based search and retrieval systems, is highly dependent upon how the user expresses the query term. Consequently, it is desirable to construct a search and retrieval system that is not highly dependent upon the exact words chosen for the query, but one that generates a similar response for different queries that have similar meanings.
Prior art search and retrieval systems do not draw inferences about the true content of documents available. If the search and retrieval system merely compares words in a document with words in a query, then the content of a document is not really being compared with the subject matter identified by the query term. For example, a restaurant review article may include words such as food quality, food presentation, service, etc., without expressly using the word restaurant because the topic, restaurant, may be inferred from the context of the article (e.g., the restaurant review article appeared in the dining section of a newspaper or travel magazine). For this example, a word comparison between a query term xe2x80x9crestaurantxe2x80x9d and the restaurant review article may not generate a match. Although the main topic of the restaurant review article is xe2x80x9crestaurantxe2x80x9d, the article would not be identified. Accordingly, it is desirable to infer topics from documents in a search and retrieval system in order to truly compare the content of documents with a query term.
Some words in the English language connote more than a single meaning. These words have different senses (i.e., different senses of the word connote different meanings). Typically, prior art search and retrieval systems do not differentiate between the different senses. For example, the query xe2x80x9cstockxe2x80x9d may refer to a type of financial security or to cattle. In prior art search and retrieval systems, a response to the query xe2x80x9cstockxe2x80x9d may include displaying a list of documents, some about financial securities and others about cattle. Without any further mechanism, if the query term has more than one sense, a user is forced to review the documents to determine the context of the response to the query term. Therefore, it is desirable to construct a search and retrieval system that displays the context of the response to the query.
Some prior art search and retrieval systems include a classification system to facilitate in the location of information. For these systems, information is classified into several pre-defined categories. For example, Yahoo!(trademark), an Internet directory guide, includes a number of categories to help users locate information on the World Wide Web. To locate information in response to a search query, Yahoo!(trademark) compares the words of the search query to the word strings of the pre-defined category. If there is a match, the user is referred to web sites that have been classified for the matching category. However, similar to the word match search and retrieval systems, words of the search query must match words in the category names. Thus, it is desirable to construct a search and retrieval system that utilizes a classification system, but does not require matching words of the search query with words in the name strings of the categories.
Factual or document knowledge base query processing in a search and retrieval system identifies, in response to a query, a plurality of documents relevant to the query. The search and retrieval system stores content information that defines the overall content for a repository of available documents. To process a query, which includes at least one query term, the factual knowledge base query processing selects documents relevant to the query terms. In addition, the factual knowledge base query processing selects additional documents, through use of the content information, such that the additional documents have content in common with content in the original document set.
In one embodiment, the content information comprises a plurality of themes for a document, wherein the themes define the overall content for the document. For this embodiment, additional documents are potentially selected based on themes common to both the original document set and the additional documents selected.
In one embodiment, the search and retrieval system utilizes a directed graph in a knowledge base. The directed graph links terminology having a lexical, semantic or usage association. The knowledge base is used to generate an expanded set of query terms, and the expanded query term set is used to select additional documents.
In one embodiment, the knowledge base includes a plurality categories. The search and retrieval system stores document theme vectors that classify the documents, as well as the themes identified for the documents, in categories of the knowledge base. The factual knowledge base query processing maps the expanded query term set to categories of the knowledge base, and selects documents, as well as the document themes, classified for those categories. Additional documents, which have themes common to the themes of the original document set, are selected. From all of the themes selected, theme groups, which include themes common to more than one document, are generated. Documents from theme groups are identified in response to the query. Also, the theme groups are ranked based on the relevance of the theme groups to the query. Furthermore, documents, within the theme groups, are ranked in order of relevance to the query.