1. Field of the Invention
This invention relates generally to computer-based methods and apparatus for processing data and, more specifically, to an automated document categorization system for filtering, retrieving, and performing similarity judgments among a sample pool of documents.
2. Brief Description of the Background Art
Known document retrieval and filtering systems generally hinge upon the ability of the system to gauge accurately how relevant and useful a selected document is to, for example, a previous document or an established category. A simple example of such a document retrieval and filtering system is one that is based on a technique that looks for keywords, i.e., words designated on a list as relevant, in a document to determine the document""s relevancy. Under such a system, any document containing one (or more) keywords is presented as possibly relevant. This technique has many well-known deficiencies to those persons skilled in the art both in its ability to provide coverage of synonyms and contextual accuracy. In addition, other than counting the number of keywords found in a selected document and possibly ranking the documents for prototypicality, it is difficult, using such a keyword based system, to distinguish between documents of the pool that all contain the keywords.
For example, with respect to coverage, documents may contain synonyms of the keywords (e.g., bike/bicycle). With regard to accuracy, many documents may contain the correct word, but with the wrong context/meaning (e.g., suit-of-clothes vs. suit-at-law). Numerous attempts have been made to correct, compensate or improve for the weakness of keyword-based document retrieval systems, such as U.S. Pat. No. 5,020,019 (Ogawa); U.S. Pat. No. 4,985,863 (Fujisawa, et al.); U.S. Pat. No. 4,823,306 (Barbie); and U.S. Pat. No. 4,775,956 (Kaji et al.). However, in contrast to the above-mentioned background art, the present invention is not keyword based and does not rely on the use of keywords at all.
Other document retrieval and filtering systems eschew keywords and instead are based on an incorporation of properties of a previously defined classification scheme, such as U.S. Pat. No. 5,568,640 (Nishiyama, et al.); U.S. Pat. No. 5,463,773 (Sakakibara, et al.); and U.S. Pat. No. 5,204,812 (Kasirj, et al.). Still other known in the art document retrieval and filtering systems are concerned with expected document properties, such as reference lists, such as U.S. Pat. No. 5,794,236 (Mehrle). In contrast, the present invention does not rely on or use previously defined classification schemes or expected document properties.
A fundamental weakness of keyword-based searches is that they are sensitive to document topic only and not to other aspects of document style, sublanguage, and register. This weakness persists in non-keyword-based techniques, such as those relying on incorporation of properties of a previously defined classification scheme or expected document properties for retrieving documents about the same topic. An advantage of the present invention is that it retrieves documents by the same author or in the same sublanguage (such as newspaper editorials) without necessary regard to a topic.
Previous techniques for classifying documents by style or sublanguage, such as described in Biber""s book Dimension of Register Variation: A Cross-Linguistic Comparison (Cambridge: Cambridge University Press, 1995), and Somers"" An attempt to Use Weighted Cusums to Identify Sublanguage, Proceedings of NemlaP-2, 1997, have required the identification of specific characteristics of style or sublanguage from a variety of candidate characteristics. In contrast, the present invention uses a single candidate characteristic that encompasses all possible aspects of variation.
U.S. Pat. No. 5,418,951 (Damashek) discloses a method of identifying, retrieving, or sorting documents by language or topic involving the steps of creating an n-gram array for each document in a database, parsing an unidentified document or query into n-grams, assigning a weight to each n-gram, removing the commonality from the n-grams, comparing each unidentified document or query to each database document, scoring the unidentified document or query against each database document for similarity, and based on the similarity of the score, identifying, retrieving, or sorting the document or query with respect to language or topic. In contrast, the present invention provides a direct distance measurement between two documents employing cross-entropy and KL-distance. It is important to note that the fixed n-gram statistics set forth in Damashek requires normalization, and thus, is not sensitive to statistical variation. The method stated in Damashek requires approximately ten documents each having approximately one-thousand characters to obtain a statistically significant sample size on which to perform language identification, and requires approximately fifty documents each having approximately one-thousand characters to obtain a statistically significant sample size on which to perform topic identification. In contrast, the method and computerized data processing system of the present invention may be applied not only to determine document topic or language, but also authorship, register, style and similarity judgments, and requires a sample based on a single document wherein the single document has a small number of characters, for example but not limited to, from about one to one hundred or more characters for estimating the similarity between the two documents.
The candidate function (i.e., entropy, KL-distance, or other information theoretic formalisms known by those skilled in the art) has been used in areas other than text classification such as set forth in U.S. Pat. No. 5,761,248 (Hagenauer, et al.); U.S. Pat. No. 5,023,611 (Chamzas, et al.); U.S. Pat. No. 4,964,099 (Carron); and U.S. Pat. No. 4,075,622 (Lawrence, et al.). The present invention applies these candidate functions to text classification and categorization. Because KL-distance is a metric function, using candidate function allows one to meaningfully measure sublanguage or style divergence. Further, applicant""s invention provides for the determination that one author is twice as far from a baseline as a second author, or to determine a probability that a given set of documents was or was not written by one author. Applicant""s invention further provides for the filtering and retrieval of documents to allow for the discard of unsolicited commercial electronic mail received at an address on the internet.
Patrick Juola, the present applicant, is the author of xe2x80x9cWhat Can We Do With Small Corpora? Document Categorization via cross-Entropyxe2x80x9d, Proceedings of SimCat 1997: An Interdisciplinary Workshop on Similarity and Categorization, pages 132-142, Nov. 28-30, 1997, that discloses an information system for estimating entropy that produces accurate judgments of language or authorship based on a sample of one or more documents and a document in question, and a method of performing document categorization and similarity judgments. Applicant claims the benefit of priority to U.S. Provisional Patent Application Serial No. 60/109,82, filed Nov. 24, 1998, entitled, xe2x80x9cDocument Categorization And Evaluation Via Cross Entropyxe2x80x9d.
In spite of this background art, there remains a very real and substantial need for computer-based methods and apparatus as provided by the instant invention for categorizing documents by applying candidate functions to text classification for filtering, retrieving and performing similarity judgments among a sample pool of documents.
The present invention has met the above-described need. The computerized data processing system of the present invention provides for efficient and economical document categorization and evaluation by applying candidate functions to data classification. The computerized data processing system for document categorization of the present invention comprises computer processor means for processing data, storage means for storing data on a storage medium, first means for creating a first fixed-size sample of data from a first document, second means for creating a second fixed-size sample of data from a second document, third means for determining a match length within the first document, fourth means for determining the match length of the second fixed-size sample of data, fifth means for determining a mean match length of the second fixed-size sample of data, and sixth means for determining a cross-entropy between the first and second documents. The computer-based processing system of the present invention as described herein, further comprises a seventh means for determining the KL-distance from the first document to the second document, and eighth means for retrieving documents in a document retrieval system employing at least one of the following selected from the group consisting of the total of the sum of the individual match lengths, the mean match length, the cross-entropy and the KL-distance.
Another embodiment of the present invention, as described herein, further comprises categorization means for categorizing documents wherein the cross-entropy is determined between a plurality of the first documents, wherein the first documents are reference documents, and the second document, and wherein the second document is a novel document, and wherein one document selected from the first documents having a value of the cross-entropy closest to zero shall be categorized as the closest document to the second document, and wherein the document categorized as the closest document to the second document shall have its category assigned to the second document.
Another embodiment of the present invention, as described herein, further comprises filtering means for filtering documents wherein the cross-entropy is determined between a plurality of the first documents and the second document, and wherein the second document is a reference document, and wherein one document selected from the first documents having a cross-entropy value higher than a threshold value shall be filtered out.
In yet another embodiment of the present invention, a computerized method for categorizing documents by applying candidate functions to data classification comprising (a) providing a computer processor means for processing data; (b) providing a storage means for storing data on a storage medium; (c) providing a first means for creating a first fixed-size sample of data from a first document; (d) providing a second means for creating a second fixed-size sample of data from a second document; (e) providing a third means for determining a match length within the first document wherein the match length comprises the longest string of consecutive characters of the second fixed-size sample of data that also appears as a string of consecutive characters in the first fixed-size sample of data; (f) providing fourth means for determining the match length at every successive character of the second fixed-size sample of data; (g) providing fifth means for determining a mean match length wherein the mean match length comprises the total sum of the match lengths of the second fixed-size sample of data divided by the number of the characters in the second fixed-size sample of data; (h) providing sixth means for determining a cross-entropy between the first document and the second document, wherein the cross-entropy comprises the logarithm of the number of the characters in the first fixed-size sample of data divided by the mean match length, and wherein the number of the characters in the first fixed-size sample of data is equal to the number of the characters in the second fixed-size sample of data; (i) providing seventh means for determining the KL-distance from the first document to the second document, wherein the KL-distance comprises the difference between the cross-entropy of the first document and an entropy of the first document, and wherein the entropy is the mean match length within the first document; and (j) providing an eighth means for retrieving documents in a document retrieval system using at least one of the following selected from a group of the total sum of the match lengths, the mean match length, the cross-entropy, and the KL-distance.
Another embodiment of the method of the present invention, as described herein, further comprises providing categorization means for categorizing documents and filtering means for filtering documents.
Yet other embodiments of the present invention, as described herein, further provide employing a plurality of second reference documents in performing document categorization.
The computerized data processing system for categorizing documents by applying candidate functions to data classification and the computerized method for document categorization of the present invention will be more fully understood from the following descriptions of the invention, the drawings and the claims appended thereto.