1. Field of the Invention
The present invention is related to the field of data mining, more particularly to classifier models and supervised methods of learning same from previous observations, for the purpose of predicting a class or category of newly encountered observations.
2. Discussion of the Prior Art
One problem in data mining is deviation detection. Deviation detection includes modeling rarely occurring phenomena or examples in data. Current work in deviation detection focuses on building classifier models for normal examples and then detecting deviations from the model.
There are many situations wherein rare deviant events may occur, for example, in network intrusion detection or fraud detection. In network intrusion detection, for attacks of type remote-to-local (r2l) on a computer made remotely by guessing a password or opening a ftp data connection, the successful incidents of attack would be rare as compared to the total number of r2l connections.
Classification is a supervised technique in data mining that learns models from labeled data (known data). Input to the classification problem is a set of observations from the real world that are recorded as a set of records, each characterized by multiple attributes. Associated with each record is a categorical attribute called a class. Given a training set of records with known class labels, one problem is to learn a model for the class in terms of other attributes.
In prior art systems, the objective has been to maximize accuracy, thus, minimizing the number of misclassifications. However, for rare classes, a mere accuracy metric is not sufficient. Consider evaluation data in which 1% of the examples belong to a rare target class; if the classifier predicts that every example belongs to a class other than the target class, then it achieves an accuracy of 99%, which is acceptable for many existing techniques. However, from the perspective of predicting rare target class examples, its performance is unacceptable because it may not have predicted any rare target examples correctly. Existing methods of learning rule-based models may fail to learn complete and precise models for rare target classes.
Disjunctive Normal Form (DNF) is the most general form of a rule-based model. The model for a given class C, is of the form R: D1 OR D2 . . . OR Dnxe2x86x92C, where each Di is called a disjunct, and is formed by conjunction of multiple conditions on attributes of the record, such as D1:(a1=P) AND (a2 less than 10), where a1 and a2 are attributes of unordered and ordered types, respectively, and P is one of the specific values taken by a1. There are two broad categories of methods of learning a DNF model: specific-to-general and general-to-specific. Specific-to-general techniques start with each record as the most specific rule, and progressively generalize the rule-set by merging or removing individual conditions and rules. However, these techniques are not usually suitable for problems with large high-dimensional data-sets because their time complexity scales poorly (e.g., quadratic in training set size and cubic in number of attributes).
General-to-specific techniques begin with the most general rule, an empty rule, and progressively add specific conditions to it. General-to-specific techniques found widespread use because of their competitive performance and computational tractability in practice. General-to-specific techniques use a method of sequential covering that iteratively discovers multiple disjuncts (D1 then D2 then D3, and so on). Every time a rule is learned, the examples supported by it are removed before the next iteration. Thus, after D1 is discovered, records where its antecedent is true are removed.
Existing techniques strive for the accuracy of each disjunct with respect to the predicted target class. Each disjunct is expected to cover a disjoint signature of the target class. If all signatures of the target class are pure, such that each signature covers very few negative examples (i.e. examples not of the target class), then this approach may produce satisfactory results.
However, there are several situations in which a sequential covering method of discovering high accuracy disjuncts may fail. For example, when the target class signature is composed of two components, presence of the target class and absence of the non-target-class, and the later component is not correctly or completely learned. This can happen, especially for rare classes, when a signature for the presence of the class is inherently impure by itself. For example, in intrusion detection, for a rare attack type r2l , a signature for the presence of an example might be connection_type=ftp. However, this will also cover ftp connections made to flood the computer in a denial-of-service (dos) attack. Thus, the rule for r2l has to be refined by detecting signatures for the absence of dos attacks.
In many existing techniques, accuracy constraints cause each rule to be refined immediately by adding more conjunctive conditions that detect a second component. However, individual rules may not cover a sufficient number of negative examples needed for learning desirable signatures, leading to splintered false positives.
Another problem in existing sequential-covering methods is small disjuncts, in which are rules that cover a small number of target class examples, typically less than about 1 or 2 percent of the target class, are more prone to generalization error than rules covering a larger number of such examples. For prevalent classes, such rules usually arise in the later iterations of the method, because the remainder dataset includes a small number of target class examples to learn from. However, for rare classes, such rules may appear very early in the method because the total number of target class examples is relatively small.
Small disjuncts are manifested in existing techniques, because of the constraints on the accuracy of each rule. The constraints result in low overall support for the rule. From statistical point of view, decisions made with small evidential support are often unreliable.
Therefore a need exists for a system and method to overcome splintered false positives and small disjuncts, having the ability to learn complete and precise classifier models for target classes including rare examples.
According to an embodiment of the present invention, a program storage device is provided readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for predicting a target class within a dataset. The method includes predicting the presence of a plurality of examples of the target class, and predicting the absence of the target class among the plurality of examples predicted to have presence of the target class.
The method further includes weighing an effect of each absence prediction on each presence prediction.
The target class includes less than three percent of the examples in the dataset. The target class includes less than two percent of the examples in the dataset.
The method includes evaluating the presence prediction and the absence prediction according to the equation:
F=2*recall*precision/(recall+precision)
wherein F is a balance between recall and precision. The recall is a predefined fraction of the examples among all target class examples. The precision is a predefined fraction of correctly predicted target class examples among all predicted examples.
According to an embodiment of the present invention, a method is provided for learning a classifier model which determines examples of a target class in a dataset. The method includes learning a plurality of positive rules supporting a plurality of examples of the target class, learning a plurality of negative rules removing a plurality of false positive examples among the examples supported by the positive rules, and weighing an effect of each negative rule on each positive rule.
The presence prediction achieves at least a predefined recall and the absence prediction achieves at least a predefined precision. Achieving the desired precision includes collecting the examples predicted by the presence prediction, andpredicting a false positive example among the examples predicted by the presence prediction.
The positive rules are learned iteratively, the positive rules satisfying a predefined criterion for a parameter of the classifier model. The parameter is one of MinSupFractionP, MinCoverageP, and MinAccuraryP.
Learning the negative rules further includes determining a contribution of each negative rule, and comparing each contribution to a predefined description length, upon determining that the contribution is greater than the description length, ending the learning of negative rules.
A weight of each effect corresponds to a probability of a given supported example belonging to the target class. A negative rule/positive rule combination having a low weight is ignored by the classifier model.
According to an embodiment of the present invention, a method for learning a classifier model is provided which predicts the presence of a target class in a dataset. The method includes learning a plurality of P-Rules supporting a plurality of examples of the target class, learning a plurality of N-Rules removing a plurality of false positive examples among the examples supported by the P-Rules, and assigning a probabilistic score to each N-Rule/P-Rule combination.
Learning P-Rules further includes learning a plurality of P-Rules individual supporting at least a first predefined number of examples and collectively supporting at least a second predefined number of examples, learning a plurality of P-Rules having at least a predefined accuracy upon determining that the second predefined number of examples are supported, and learning the N-Rules upon determining that a subsequent P-Rule has an accuracy less than a predefined accuracy.
Learning the negative rules includes determining a cost for an N-Rule, and comparing the cost to a predefined description length, upon determining that the cost is greater than the description length, ending the learning of N-Rules.
Learning the P-Rules is via sequential covering. Learning the N-Rules is via sequential covering.