As the volume of information available on the Internet increases, the need for methods to search, filter, and manage such information is increasing. Text categorization has become an important component in many information search and retrieval systems. Conventional search and retrieval systems commonly classified information into several predefined categories. For example, Yahoo!'s topic hierarchy provides a complex tree of directories to help users locate information on the Internet. In many search engine companies, trained text editors manually categorize information. Such manual classification of information is not only a very time-consuming and costly process, but is also plagued by inconsistencies and oversight problems. To overcome such problems, automated methods for categorizing text are becoming more common.
U.S. Pat. No. 5,418,948 to Turtle discloses an information retrieval system, which stems all input words (as well as removing stopwords), and matches the resulting queries against a table of known phrases in order to convert phrasal inputs into a standardized format. A Bayesian inference network ranks the results, where each document is associated with a set of probabilities for all of the words within the document. These probabilities are calculated with respect to a hierarchy of document-organization categories. Retrieval may be accomplished through two techniques, which can result in different rankings for the same collection of documents.
The first technique used by Turtle is document-based scanning, where each document is evaluated according to the probabilities of each of its attributes in order to determine the probability that it answers the query. After a sufficient number of documents are retrieved, while scanning continues through the collection, documents are only evaluated through a subset of their attributes. This means that after some critical number of documents is reached, documents which are unlikely to rank higher than the lowest-ranking document in the set are not added to the list of results.
The second technique involves so-called concept-based scanning, where all documents containing a query concept (including its synonyms as defined in a thesaurus) are evaluated according to the probability of the query attribute within the document. This means that only a few attributes are examined for each document, but they are the same for all documents. As with document-based scanning, documents are no longer added to the result set when a critical number is reached and the probability of a new document outranking any documents currently within the set is extremely low. The stopping criteria are not identical, and the interpretation of the same attribute probabilities may lead to different rankings for the same documents, even when matched by the same query. In both cases, scoring is calculated by averaging the probabilities for all of the attributes in the query (adding the individual probabilities, then dividing by the number of concepts).
Turtle's system is deficient in several respects. First, by scoring documents according to the relative occurrences of terms within the index, highly relevant documents with low-probability concepts may be missed. Second, periodic recalculation of attribute probabilities is a necessary performance penalty, if new documents will change the probability distribution of the attribute set. Third, the thesaurus-based approach treats synonyms as equally valuable terms in a query. This may expand the result set such that the stopping criteria described above end up filtering out documents containing the exact match in favor of documents containing a higher proportion of synonymous words. It is not clear that this is a desirable result from the standpoint of an end-user who is particularly interested in the exact word used for the query. Turtle's system does not take grammatical structure into account; in fact, it does not take adjacency information into account, since each document is treated as a “bag of words,” with no preservation of order information.
U.S. Pat. No. 4,270,182 to Asija discloses a system for asking free-form, un-preprogrammed, narrative questions. The system of Asija accepts unstructured text from multiple sources and divides the text into logical information units. These logical information units may be sentences, paragraphs, or entire documents; each logical information unit is assigned a unique identification number, and is returned as a whole when it is selected for retrieval. The retrieval system of Asija uses standard keyword-based lookup techniques.
The procedure of Asija only applies to the logical information units, which are ranked as equally relevant at the end of a preceding stage. Both synonyms and searchonyms are considered as equivalent to query words found within the logical information units. The net effect of the ranking and filtering process of Asija is to order documents by maximizing the number of query words matched, followed by the number of instances of query words. Furthermore the Asija system does not take grammatical structure into account. In addition, synonyms are not exact matches for queries, and thus should be ranked lower. The Asija system also only makes use of literal text strings, as all synonyms must be specified by dictionary files that list text strings as equivalent.
A key feature of the present invention is the unique and novel method of representing text in the form of numerical vectors. The vectorization techniques of the present invention offer several advantages over other attempts to represent text in terms of numerical vectors. First, the numbers used are ontologically generated concept representations, with meaningful numerical relationships such that closely related concepts have numerically similar representations while more independent concepts have numerically dissimilar representations. Second, the concepts are represented in the numerical form as part of complete predicate structures, ontological units that form meaningful conceptual units, rather than simple independent words. Third, the vectorization method and system described herein provides a way to represent both large portions of long documents and brief queries with vector representations that have the same dimensionality. This permits rapid, efficient relevancy ranking and clustering by comparing the query vectors with substantial portions of documents, on the order of a page or more at a time, with no loss of accuracy or precision. Furthermore, it permits comparisons of large-scale patterns of concepts across entire documents rather than the small moving windows used in prior systems. These advantages provide the present method and system with unique performance and accuracy improvements over conventional systems.