The relevancy ranking and clustering method and system for document indexing and retrieval of the present invention is intended to provide mechanisms for an information retrieval system to rank documents based on relevance to a query and in accordance with user feedback. A user can make queries in the form of natural language, keywords or predicates. Queries are converted into ontology-based predicate structures and compared against documents, which have been previously parsed for their predicates, to obtain the best possible matching documents, which are then presented to the user. The present method and system is designed to automate judgments about which documents are the best possible matches to a query within a given index. The system is further designed to allow users to provide feedback in order to fine-tune the automated judgment procedure.
As the volume of information available on the Internet increases, the need for methods to search, filter, and manage such information is increasing. Text categorization has become an important component in many information search and retrieval systems. Conventional search and retrieval systems commonly classified information into several predefined categories. For example, Yahoo!""s topic hierarchy provides a complex tree of directories to help users locate information on the Internet. In many search engine companies, trained text editors manually categorize information. Such manual classification of information is not only a very time-consuming and costly process, but is also plagued by inconsistencies and oversight problems. To overcome such problems, automated methods for categorizing text are becoming more common.
U.S. Pat. No. 5,418,948 to Turtle discloses an information retrieval system, which stems all input words (as well as removing stopwords), and matches the resulting queries against a table of known phrases in order to convert phrasal inputs into a standardized format. A Bayesian inference network ranks the results, where each document is associated with a set of probabilities for all of the words within the document. These probabilities are calculated with respect to a hierarchy of document-organization categories. Retrieval may be accomplished through two techniques, which can result in different rankings for the same collection of documents.
The first technique used by Turtle is document-based scanning, where each document is evaluated according to the probabilities of each of its attributes in order to determine the probability that it answers the query. After a sufficient number of documents are retrieved, while scanning continues through the collection, documents are only evaluated through a subset of their attributes. This means that after some critical number of documents is reached, documents which are unlikely to rank higher than the lowest-ranking document in the set are not added to the list of results.
The second technique involves so-called concept-based scanning, where all documents containing a query concept (including its synonyms as defined in a thesaurus) are evaluated according to the probability of the query attribute within the document. This means that only a few attributes are examined for each document, but they are the same for all documents. As with document-based scanning, documents are no longer added to the result set when a critical number is reached and the probability of a new document outranking any documents currently within the set is extremely low. The stopping criteria are not identical, and the interpretation of the same attribute probabilities may lead to different rankings for the same documents, even when matched by the same query. In both cases, scoring is calculated by averaging the probabilities for all of the attributes in the query (adding the individual probabilities, then dividing by the number of concepts).
Turtle""s system is deficient in several respects. First, by scoring documents according to the relative occurrences of terms within the index, highly relevant documents with low-probability concepts may be missed. Second, periodic recalculation of attribute probabilities is a necessary performance penalty, if new documents will change the probability distribution of the attribute set. Third, the thesaurus-based approach treats synonyms as equally valuable terms in a query. This may expand the result set such that the stopping criteria described above end up filtering out documents containing the exact match in favor of documents containing a higher proportion of synonymous words. It is not clear that this is a desirable result from the standpoint of an end-user who is particularly interested in the exact word used for the query. Turtle""s system does not take grammatical structure into account; in fact, it does not take adjacency information into account, since each document is treated as a xe2x80x9cbag of words,xe2x80x9d with no preservation of order information.
U.S. Pat. No. 4,270,182 to Asija discloses a system for asking free-form, un-preprogrammed, narrative questions. The system of Asija accepts unstructured text from multiple sources and divides the text into logical information units. These logical information units may be sentences, paragraphs, or entire documents; each logical information unit is assigned a unique identification number, and is returned as a whole when it is selected for retrieval. The retrieval system of Asija uses standard keyword-based lookup techniques.
The procedure of Asija only applies to the logical information units, which are ranked as equally relevant at the end of a preceding stage. Both synonyms and searchonyms are considered as equivalent to query words found within the logical information units. The net effect of the ranking and filtering process of Asija is to order documents by maximizing the number of query words matched, followed by the number of instances of query words. Furthermore the Asija system does not take grammatical structure into account. In addition, synonyms are not exact matches for queries, and thus should be ranked lower. The Asija system also only makes use of literal text strings, as all synonyms must be specified by dictionary files that list text strings as equivalent.
A key feature of the present invention is the unique and novel method of representing text in the form of numerical vectors. The vectorization techniques of the present invention offer several advantages over other attempts to represent text in terms of numerical vectors. First, the numbers used are ontologically generated concept representations, with meaningful numerical relationships such that closely related concepts have numerically similar representations while more independent concepts have numerically dissimilar representations. Second, the concepts are represented in the numerical form as part of complete predicate structures, ontological units that form meaningful conceptual units, rather than simple independent words. Third, the vectorization method and system described herein provides a way to represent both large portions of long documents and brief queries with vector representations that have the same dimensionality. This permits rapid, efficient relevancy ranking and clustering by comparing the query vectors with substantial portions of documents, on the order of a page or more at a time, with no loss of accuracy or precision. Furthermore, it permits comparisons of large-scale patterns of concepts across entire documents rather than the small moving windows used in prior systems. These advantages provide the present method and system with unique performance and accuracy improvements over conventional systems.
The basic premise of relevancy ranking and clustering is that a set of documents is sorted or ranked, according to certain criteria and clustered to group similar documents together in a logical, autonomous manner.
The relevancy ranking and clustering method and system of the present invention scores documents by word meaning and logical form in order to determine their relevance to the user""s query. It also compares patterns of concepts found within documents to the concepts within the query to determine the most relevant documents to that query.
As part of the relevancy ranking and clustering method and system, documents and user queries are first parsed into ontological predicate structure forms, and those predicate structures are used to produce a novel numerical vector representation of the original text sources. The resulting vector representations of documents and queries are used by the relevancy ranking unit and the document clustering component of the present invention to perform the ranking and clustering operations described herein. The unique and novel method of producing the vector representations of documents and queries provides efficiency, accuracy, and precision to the overall operation of the relevancy ranking and clustering method and system.
Input queries and documents are parsed into one or more predicate structures using an ontological parser. An ontological parser parses a set of known documents to generate one or more document predicate structures. Those predicate structures are then used to generate vector representations of the documents and queries for later use by the ranking and clustering method and system.
The ranking and clustering method and system performs a comparison of each query predicate structure with each document predicate structure, and of document vectors to query vectors, to determine a matching degree, represented by a real number. A multilevel modifier strategy is implemented to assign different relevance values to the different parts of each predicate structure match to calculate the predicate structure""s matching degree.
When many documents have a high similarity coefficient, the clustering process of the relevancy ranking and clustering method provides a separate, autonomous process of identifying documents most likely to satisfy the user""s original query by considering conceptual patterns throughout each document, as opposed to individual concepts on a one-by-one basis.
The relevancy ranking and clustering method and system of the present invention provides a fine-grained level of detail for semantic comparisons, due to the fact that conceptual distance can be measured and weighted absolutely for all terms within a query. In addition, the relevancy ranking and clustering method and system of the present invention provides a sophisticated system for ranking by syntactic similarity, because syntactic evaluation occurs on lists of predicate arguments. The manner in which the arguments are derived is irrelevant, and can be accomplished through any syntactic parsing technique. This provides a more general-purpose ranking system.
The relevancy ranking and clustering method and system of the present invention ranks according to grammatical structure, not mere word adjacency. Thus, a passive sentence with more words than an equivalent active sentence would not cause relevant documents to be weighted lower. The relevancy ranking and clustering method and system of the present invention makes use of actual word meaning, and allows the user to control ranking based on word similarity, not just presence or absence.
The relevancy ranking and clustering method and system of the present invention also ranks according to conceptual co-occurrence rather than simple word or synonym co-occurrence as is found in other systems. This provides the advantage of recognizing that related concepts are frequently found near each other within bodies of text. The relevancy ranking and clustering method and system furthermore considers overall patterns of concepts throughout large documents and determines matches to query concepts no matter where in the document the matches occur.
The relevancy ranking and clustering method and system additionally provides simple and efficient means of recognizing that frequency of occurrence of concepts within a document often correspond to relative importance of those concepts within the document. The present invention autonomously recognizes frequently occurring query concepts located within a document and judges such documents as more relevant than documents in which the query concepts occur only rarely. The autonomous and efficient method of accomplishing this is based both on the unique vectorization techniques described herein and on the operation of the ranking and clustering method and system.
The relevancy ranking and clustering method and system of the present invention allows the user to specify whether documents similar to a particular document should be ranked higher or lower, and automatically re-ranks such documents based on a neural network. The neural network provides a coordinate system for making such judgments in an autonomous and non-subjective fashion, which does not require trial-and-error efforts from the user. Finally, there is no code generation or recompilation involved in the present system, which only performs the needed document clustering once; requests for similar or different information return a different segment of the result set, but without recomputing the relations between all the documents, as is required in a spreadsheet-like or other statistical approach.
The relevancy ranking and clustering method and system of the present invention uses grammatical relationship information to adjust ranking relations. Although words do not need to be grammatically related to each other within a document to include the document in the result set, such relationships serve to adjust the rankings of otherwise similar documents into a non-random, logical hierarchy. Furthermore, each word within the present system is a meaningful entity with mathematically meaningful distances relative to other concepts. Thus, synonyms are not treated as probabilistically equal entities, but are assigned lower rankings depending on how far they are from the exact query word given by the user.
Documents containing sentences that logically relate query terms are ranked higher than documents which simply contain instances of those terms. Similarly, thesaurus-like query expansion is made possible through the use of ontologies, and the present ranking system enables similar concepts to be graded according to the degree of their similarity. This capability represents a significant innovation over other purely statistical techniques.
In addition to giving higher weights to documents where search terms occur in close proximity, the relevancy ranking method and system of the present invention is able to make further discrimination by whether or not the search terms are bound together in a single predicate within a document. Additionally, the relevancy ranking and clustering method and system of the present invention is capable of discriminating between documents based on conceptual similarity so that conceptually similar, but inexact, matches receive lower weights than exactly matched documents.
In the relevancy ranking and clustering method and system of the present invention, the vector representations of individual documents and user queries are based not on individual words but on patterns of conceptual predicate structures. Dynamic alteration can be made to the content of the document sets, thus allowing the relevancy ranking and clustering method and system to begin its processing even before the search for potential matching documents is complete.
As a result, the relevancy ranking and clustering method and system of the present invention provides an automatic process to cluster documents according to conceptual meanings. The present system is designed to make fine discriminations in result ranking based on the degree of conceptual similarity between words, i.e., exactly matched words result in higher rankings than synonyms, which in turn result in higher rankings than parent concepts, which in turn result in higher rankings than unrelated concepts.