1. Field of the Invention
The present invention is directed toward the field of knowledge base systems, and more particularly towards automatically extending cross references in a knowledge base based on a corpus of documents.
2. Copyright Notice
This application contains material subject to copyright protection. The copyright owner has no objection to the facsimile reproduction of this material as it appears in the United States Patent and Trademark Office files or records, but otherwise reserves all copyright rights whatsoever.
3. Art Background
An information retrieval system attempts to match user queries (i.e., the users statement of information needs) to locate information available to the system. In general, the effectiveness of information retrieval systems may be evaluated in terms of many different criteria including execution efficiency, storage efficiency, retrieval effectiveness, etc. Retrieval effectiveness is typically based on document relevance judgments. These relevance judgments are problematic since they are subjective and unreliable. For example, different judgment criteria assigns different relevance values to information retrieved in response to a given query.
There are many ways to measure retrieval effectiveness in information retrieval systems. The most common measures used are “recall” and “precision.” Recall is defined as the ratio of relevant documents retrieved for a given query over the number of relevant documents for that query available in the repository of information. Precision is defined as the ratio of the number of relevant documents retrieved over the total number of documents retrieved. Both recall and precision are measured with values ranging between zero and one. An ideal information retrieval system has both recall and precision values equal to one.
Some information retrieval tools, such as Oracle® Corporation's interMedia Text, use a lexicon in order to improve precision and recall. The lexicon consists of a very large repository of language specific words/phrases, their corresponding parts of speech information, and their relationships to each other. These lexicons are mostly language dependent and are manually constructed. A typical lexicon contains about half a million words/phrases for the English language. The process of manually establishing relationships between such large numbers of words is time consuming.
Typically, the entries in these lexicons are arranged in a tree shaped hierarchy. Some relationships for a hierarchical lexicon include parent—child and child—parent relationships. In addition, another relationship establishes the associations between any two words in the lexicon. For purposes of nomenclature, this relationship is referred to as a “cross reference relationship.” In general, cross references may be characterized as links between two different nodes or words within a hierarchical tree structure. In some manifestations, these cross reference relationships include an associated weight to indicate the strength with which the two nodes are related.
Since lexicons are manually constructed, the words and phrases within them cannot possibly span all areas of interest and knowledge. This is especially true when it comes to new knowledge and terminology. In addition, the cross reference relationships within one area of interest may be drastically different from another area of interest. Thus, there are two problems associated with generating cross references among words in a lexicon. First, a problem exists as to how to establish cross reference relationships with words not already present in the lexicon, even though these relationships are pertinent to a dataset (i.e., documents) under analysis. A second problem exists as to how to establish new cross references within existing words based on the specific usage of words in the data set under analysis. A system that solves these problems leads to improve precision and recall for use in information retrieval systems.