The present invention relates generally to categorization of content and more particularly to the automatic categorization of documents based upon textual content.
A number of different techniques have been developed to help automatically classify documents into categories based upon their textual appearance. The techniques can be largely categorized into five categories: techniques based on phonemes characteristics, techniques based on statistics for n-grams, techniques based on statistics for keywords, rule-based techniques, and techniques based on semantic topics. These different categories of techniques will be discussed in more detail below.
These classification techniques are generally based on measures of relationships between object, such as xe2x80x9csimilarityxe2x80x9d or xe2x80x9cassociation.xe2x80x9d The measure of the relationship is designed to quantify the likeness among objects. For any given object, the object in a group is more like the other members of the group than it is an object outside the group.
Techniques based on phonemes characteristics examine phonemes in the document. A xe2x80x9cphonemexe2x80x9d is a smallest significant phonetical unit in a language that can be used to distinguish one word from another. For example, the xe2x80x9cpxe2x80x9d in the word xe2x80x9cpitxe2x80x9d may be a phoneme. Certain prior art systems have used hidden markov models (HMMs) for phonemes to model language. One example of a system that used HMMs is U.S. Pat. No. 5,805,771, issued Sep. 8, 1998, to Muthusany, et al. This patent describes a language identification set that models phonemes using HMMs. The paper proposes several enhancements to traditional HMM based language identification systems, including a language independent acoustic model. This technique appears limited to the processing of phonemes.
U.S. Pat. No. 5,625,748, issued Apr. 29, 1997 to McDonough et al., concerns a method for discriminating topics from speech events. The method is based on a word/phrase spotter and topic classifier that is trained in topic dependent event frequencies.
N-gram based techniques examine n-grams to categorize documents. An xe2x80x9cn-gramxe2x80x9d is a letter sequence of length n. Such techniques have been shown to be useful for identifying the native language in which a document is written. A paper by Gregory Grefenstette, xe2x80x9cComparing Two Language Identification Schemes.xe2x80x9d Proceedings of the 3rd International Conference on the Statistical Analysis of Textual Data, JADT""95, December 1995, describes an n-gram based technique. With this technique, statistics are gathered on three letter sequences known as trigrams, at the basic signatures for a language. The paper compares n-gram techniques and a keyword based technique and, as a result of the comparison, favors the n-gram technique.
U.S. Pat. No. 5,418,951, issued May. 23, 1995, to Damashek et al also discloses an n-gram technique. The Damashek patent describes a method of retrieving documents that concern similar semantic topics or documents that are written in similar languages. The method relies upon the creation of a letter N-gram array for each document.
U.S. Pat. No. 5,062,143 issued Oct. 29, 1991 to Schmitt also concerns an n-gram technique. The Schmitt patent describes a technique for identifying the language in which a document is written based upon statistical analysis of letter trigrams appearing in the document. The technique uses a simple threshold based on the number of matches to determine the identity of the language in which the document is written. The matches are made against a library of precompiled key sets for a number of different languages.
Techniques based on statistics of words gather statistical information regarding keywords within documents. These statistics are then subsequently used to categorize a document. For example, word frequencies may be utilized to categorize a document. The presence of one or more frequently occurring words in a document may serve as a fairly reliable indicator that the document is of a specified document category.
U.S. Pat. No. 5,182,708, issued Jan. 26, 1993 to Ejiri, discloses a method for classifying a document to one of several predefined categories on the basis of word frequency. The method pivots around statistical characteristics of constrained languages and unconstrained languages. These statistical characteristics enable the method to distinguish, for example, English spoken by a native speaker from English spoken by a speaker for which English is a second language and from programming language code.
Rule-based techniques employ rules to implement categorization. For example, a rule may indicate that if a document contains the word xe2x80x9cbyte,xe2x80x9d the document relates to computers. This approach requires human experts to encode intelligence in the form or rules into the system. The rules are then used to perform the categorization.
Techniques based on semantic topics exploit relationships between words and semantic topics in documents. The relationships are used to assign topics to documents based on the words that are used within the documents. The assignment of the topics is a form of categorization.
U.S. Pat. No. 5,687,364, issued Nov. 1, 1997 to Saund, et al, describes a method of leaming relationships between words and semantic topics in documents. The method assigns topics to documents based upon the words used in the documents. The method learns topics of the training data and associates a list of words with each topics. At run time, based on the words used in the document, the system identifies the matching topics.
U.S. Pat. No. 5,873,056, issued Feb. 16, 1999 to Liddy, et al, describes methods for predicting a semantic vector that represents a document based on words contained within the document. The method relies upon lexical database and subject codes. Non ambiguous words are assigned subject codes. The remaining words in the document are disambiguated based on the examination of the frequency with which other subject codes appear. Topic assignment is also based on the frequency with which other subject codes appear.
The present invention provides an approach to categorizing documents that is highly efficient. In addition, the approach adopted by the present invention may be empirically adjusted to produce improved results in view of feedback. In one embodiment of the present invention, a neutral category is utilized that represents documents that are not encompassed by the other categories. In determining whether to place a document in a given category, a comparison is made whether document better fits into the neutral category or the given category. The present invention is generalizable enough to perform many different types of categorization. Moreover, the present invention is extensible and requires little or no manual work to add a category. Still further, the present invention can handle multiple unrelated categories.
In accordance with one aspect of the present invention, a method of categorizing a selected document based upon textual content is performed. In this method, the document categories are provided into which documents may be categorized. A lexicon of tokens is provided for training. The tokens are partitioned into partitions based on frequency of occurrence of the tokens in respective subsets of training materials for each document category. A metric of frequency of occurrence of the token in the selected document is calculated per document category for each token in the selected document. For each of the partitions of each of the categories, a deviation factor is calculated using the calculated metric of frequency of occurrence of the token in the selected document per document category. Each deviation factor identifies the extent of deviation of the calculated metric in the partition. For each category, the deviation factors for the partitions of the categories are used to determine whether the document is to be categorized in the document category. The tokens may take many forms but may be words in some embodiments. The selected document may be an electronic mail message, a word processing document, a document that contains computing instructions or any of a number of other types of documents.
In accordance with another aspect of the present invention, a method of categorizing an input document is performed on an electronic device. In this method, a neutral category is provided for documents that do not fit into any of the other document categories. For each word in the input documents and for each document category a difference is determined between a frequency with which the word occurs and the selected document and an average frequency that the word occurs in training documents for the document category. For each document category other than a neutral category, frequency z-scores of words in the input document are compared with frequency z-scores of the words in the training documents for the category and are also compared with frequency z-scores of the words in the training documents for the neutral category to determine whether the input document is to added to the document category.