1. Field of the Invention
The present invention generally relates to supervised learning as applied to text categorization, and, more particularly, to a method for categorizing messages or documents containing text.
2. Background Description
The text categorization problem is to determine predefined categories for an incoming unlabeled message or document containing text based on information extracted from a training set of labeled messages or documents. Text categorization is an important practical problem for companies that wish to use computers to categorize incoming email, thereby either enabling an automatic machine response to the email or simply ensuring that the email reaches the correct human recipient. Beyond email, text items to be categorized may come from many sources, including the output of voice recognition software, collections of documents (e.g., news stories, patents, or case summaries), and the contents of web pages.
For the purposes of the following description, any data item containing text is referred to as a document, and the term herein is to be taken in this most general sense.
Previous text categorization methods have used decision trees, naive Bayes classifiers, nearest neighbor methods, neural nets, support vector machines and various kinds of symbolic rule induction.
The present invention relates to symbolic rule induction systems, so such systems will now be described at a general level that is known in the art. In such a system, data is represented as vectors in which the components are numerical values associated with certain features of the data. The system induces rules from the training data, and the generated rules can then be used to categorize arbitrary data that is similar to the training data. Each rule ultimately produced by such a system states that a condition, which is usually a conjunction of simpler conditions, implies membership in a particular category. The condition forms the antecedent of the rule and the conclusion posited as true when the condition is satisfied is the consequent of the rule. Usually, a data item is represented as a vector of numerical components, with each component corresponding to a possible feature of the data, and antecedent of a rule is combination of tests to be done on various components. Under a scenario in which features are words that may appear in a document and the corresponding numerical values in vectors representing documents are word counts, an example of a rule is
xe2x80x83share greater than 3 and year less than =1 and acquire greater than 2xe2x86x92acq
which may be read as xe2x80x9cif the word xe2x80x98sharexe2x80x99 occurs more than three times in the document and the word xe2x80x98yearxe2x80x99 occurs at most one time in the document and the word xe2x80x98acquirexe2x80x99 occurs more than twice in the document, then classify the document in the category xe2x80x98acqxe2x80x99.xe2x80x9d Here the antecedent is
share greater than 3 and year less than =1 and acquire greater than 2
and the consequent is acq. Alternatively, the rule above could be read as xe2x80x9cif words equivalent to xe2x80x98sharexe2x80x99 occur more than three times in the document and words equivalent to xe2x80x98yearxe2x80x99 occur at most one time in the document and words equivalent to xe2x80x98acquirexe2x80x99 occur more than twice in the document, then classify the document in the category xe2x80x98acqxe2x80x99.xe2x80x9d This later reading of the rule reflects an assumption that stemming was done. Stemming is the replacement of words by corresponding canonical forms (or stems). Existing symbolic rule induction systems do not categorize documents accurately enough for many commercial applications, or their training time is excessive, or both.
It is therefore an object of the present invention to provide a method to automatically categorize messages or documents containing text. The hitherto unsolved practical problem in the field of text categorization is to provide a general text categorization system that in turn provides superior performance in six different ways. These six aspects, which will be explained in more detail below, are:
1. precision,
2. recall,
3. provision for multiple categorization,
4. provision of confidence levels,
5. training speed, and
6. insight and control.
Previous systems fall short on one or more of these desired features. The present invention solves this problem by delivering high performance or providing required functionality in each way.
Precision and recall (1 and 2) are basic measures of the performance of a categorizer. Precision is the proportion of the decisions to place documents in specific categories made by a text categorization system that are correct. Recall is the proportion of the actual category assignments that are identified correctly by a text categorization system. Precision and recall are much more useful measures of performance in the area of text categorization than the error rate, which is commonly used in most other areas of machine learning. This is because, in text categorization, one typically has many small categories, and so one could obtain a categorizer with a low error rate by simply using a categorizer that placed no document in any category, but such a categorizer would have very little practical utility. Of course, there is a connection between a categorizer""s error rate, on one hand, and a categorizer""s recall and precision, on the other because one cannot simultaneously have excellent recall and precision along with a poor error rate.
Multiple categorization (3) is the possibility for a single document to be assigned to more than one category. This is an essential kind of flexibility needed in many applications. However, a text categorization system that provides for multiple categorization is well-served by a method for assessing the significance of more than one category being assigned to a document. Such a method is the provision of confidence levels (4).
Confidence levels are quantified relative indicators of the level of confidence that may be placed in a categorizer""s recommendations. Confidence levels are real numbers typically ranging from 0.0 to 1.0 inclusive, with 0.0 indicating lowest confidence and 1.0 indicating greatest confidence. Confidence levels are particularly important in practical applications of text categorization such as routing email or sending automatic responses to email. Applications of this method should make significant use of confidence levels in evaluating possible alternatives related to a categorizer""s assignment of categories to a document. However, previous symbolic rule induction text systems for text categorization have not provided confidence levels as part of the rules.
Training speed (5) refers to the time it takes for a computer to generate a categorizer from training data.
Finally, insight and control (6) refers to the ability of people to understand and modify manually a text categorizer. This is extremely important in real commercial applications in which enterprises frequently have gaps in the coverage of their training data. Inability to compensate for a gap in data coverage could doom a text-categorization-dependent application, such as routing or automatically responding to email. Approaches used in the prior art for text categorization preclude manual intervention. One corollary to the desire for insight and control is that the justifications for a text categorization system""s recommendations should be a simple as possible.
According to the invention, a method of solution fits in the general framework of supervised learning, in which a rule or rules for categorizing data is automatically constructed by a computer on the basis of training data that has been labeled with a predefined set of categories beforehand. More specifically, the method for rule induction involves the novel combination of:
1. inducing from the training data a decision tree for each category;
2. the automated construction from each decision tree of a simplified symbolic rule set that is logically equivalent overall to the decision tree, and which is to be used for categorization instead of the decision tree; and
3. the determination of a confidence level for each rule.
The method covers both decision-tree-based symbolic rule induction and the use for the purpose of document categorization of rules in the logical format of those generated by the rule induction procedure described herein.
A simplified rule set in the present context, is a relative concept. In other words, a rule set is simpler than one that is directly to be read from a decision tree by traversing all the paths from the root to the leaves. Of course, in some cases, an algorithm for computing a simplified rule set may fail to do any actual simplification, particularly if the rule set comes from a very simple decision tree. However, most decision trees induced to categorize text are not so simple and contain dozens or hundreds of tests. It should be noted that an individual rule in a rule set logically equivalent to a decision tree need not necessarily be logically equivalent to a single branch of the decision tree that was the basis for the rule set. The present method for simplifying a rule set takes advantage of this fact, while still producing a rule set equivalent to a decision tree in overall effect.
The logical format of the rules produced and used by this method is more general than that of other methods, in that the rules may include confidence levels. Thus, the rules produced by the rule induction part of this method are in the format of an antecedent, a consequent, and a confidence level. An example of a rule that can be produced by this method is
share greater than 3 and year less than =1 and acquire greater than 2 acq @0.75
which, under the assumption that stemming is done, may be read as, xe2x80x9cif words equivalent to xe2x80x98sharexe2x80x99 occur more than three times in the document and words equivalent to xe2x80x98yearxe2x80x99 occur at most one time the document and words equivalent to xe2x80x98acquirexe2x80x99 occur more than two times in the document, then classify the document in the category xe2x80x98acqxe2x80x99, with a confidence level of 0.75.xe2x80x3
Certainly for text categorization, the use of this logical format for rules, suitably understood, is novel. To see this, confidence levels are distinguished, as they are used here, from related concepts that have preceded them. First, the present method calls for confidence levels to be computed individually for each rule, and so, since individual rules are not necessarily equivalent to branches of the decision tree from which they were derived, confidence levels will differ from conventional estimates of the probability of category membership corresponding to branches of the decision tree. Second, a confidence level for an individual rule is more fine-grained than, and should not be confused with, an overall estimate of the probability that a particular document belongs to a particular category. The latter concept is connected with all the rules that may apply to a document, and there may be many of them. Although the two concepts are not unconnected, confidence levels for individual rules are more useful because a practical actions based on categorization decisions could well make use of the nature of the specific rule that gave rise to a categorization decision and its specific confidence level.
While the rule induction technique of this method will always produce rules with confidence levels, in the course of categorizing documents it could be possible to encounter rules from which the confidence level is missing. This could happen because an automatically generated rule set might have been modified, augmented, or replaced by hand-edited rules. Handwritten supplementary rules, as well as hand-edited replacements for machine generated rules, might well be missing confidence levels. If rules missing confidence levels are encountered in the course of categorizing documents, those rule will be treated as though they possess some default confidence level. Normally the default confidence level is taken to be 1.0, assuming the range of confidence levels is from 0.0 to 1.0 inclusive.
Moreover, the rules produced by this method may involve more complicated features than simply single occurrences of words anywhere in a document. In particular, if the sections of a document in which a feature may occur are deemed significant, then an example of a rule might be
body|trade greater than 3 and title|trade=0xe2x86x92trade @0.87
Under the assumption that stemming is being done, a reading of the last rule is as follows: xe2x80x9cIf words equivalent to xe2x80x98tradexe2x80x99 occur more than three times in the body section and if no word equivalent to xe2x80x98tradexe2x80x99 occurs in the title section, then the document can be assigned to the category trade with a confidence level of 0.87.
Alternatively, if sections of a document are deemed to be significant, but one wishes to have only one feature for a given word, then the numerical value for that feature may be taken to be a weighted combination of the word counts from the different sections. When such a rule is used in document categorization, the same weighted combination of word counts must be used in determining if the rule fires as was used in training when the rule was induced.
The unique technique of this method that, for the purpose of text categorization, integrates decision trees, simplified logically equivalent symbolic rule sets, and confidence levels is a mark of novelty.
In an alternative embodiment, the present invention uses a novel algorithm for decision tree induction that is very fast and effective for training on text data as a component of a text categorization system. There is a novel combination of three major innovations in the algorithm for decision tree induction:
1. In growing the tree, advantage is taken of the sparse structure of the data, as is the case generally for text data.
2. Also in growing the tree, modified entropy is used in the definition of the cost function that measures the impurity of a split (a key computation that guides decision tree induction).
3. Tree smoothing is used to prune the decision tree.