Automated pattern recognition is well known in the art, in a variety of applications. Common applications of pattern recognition include image analysis, speech recognition, and predicting unknown fields for records in a database. Typically, a template or a collection of rules is determined, which are believed to constitute a pattern that is characteristic of a certain class. Items in the set are then evaluated to determine how closely they fit the pattern. A close fit indicates a high probability that the item being evaluated falls within the class. Thus, a face may be found to belong to a certain individual; or a spoken sound may be found to correspond to a certain word; or a bank customer may be predicted to be a good or bad credit risk.
In order to build the template or rules, "data mining" of a training database is frequently used. The training database is selected and is assumed to be a real and representative sample of the overall population. The training database generally contains variables (fields, representing various attributes of the items) of different types, among which a field is selected as the "Field to Predict" (also referred to in the database art as the "output", or "result", or "dependent" variable). The training database may be represented as a temporary file of codes of attribute values, given by the matrix: EQU A={x.sub.1, . . . , x.sub.n,y}.sub.1.sup.N (1)
consisting of N records, where a record is a vector of previously encoded values of n input (predicting) fields x.sub.1, . . . , x.sub.n and of the output (predicted) field y. The purpose is to discover all essential regularities within the investigated file A, thus enabling one to predict the unknown value of output field y based on the known values of variables x.sub.1, . . . , x.sub.n. The same encoding methods may be applied when the rules derived from the training database are applied to the rest of the population.
Typically, the Field to Predict is Boolean, i.e., the output field can be specified as y .epsilon.{0,1}. It will be understood, however, that similar methods may be applied, mutatis mutandis, when the output field has a wider range of possible values.
The accuracy of prediction of the value of y may be verified by testing on the training database. However, such testing may not be sufficient, since there exists the problem of "overfitting," wherein regularities discovered on the training database turn out to be found by chance, and are therefore not valid on other samples from the overall population. Thus, the purpose of data mining is to discover regularities within the training database which possess the property of likely stability of their validity on the whole population.
Regularities of this sort are sought within the training database in the form of association rules. Methods of deriving such association rules are described, for example, by Agrawal, et al., in "Fast Discovery of Association Rules," in Advances in Kowledge Discovery and Data Mining (AAAI Press/MIT Press, 1996), pages 307-328; and by Zaki, et al., in "New Algorithms for Fast Discovery of Association Rules," in Proc. 3rd Int. Conf. KDD (California), pages 283-286. These publications are incorporated herein by reference.
A formal definition of an association rule is as follows: Let I.sub.i ={a.sub.i1, . . . , a.sub.im.sub..sub.i } be the set of codes of values of a variable x.sub.i. A single condition (referred to herein as a 1-condition) is a condition of the type: x.sub.i =a.sub.ij, j .epsilon.{1, . . . , m.sub.i }. A composite condition is a conjunction of q single conditions, wherein q=2, . . . , n, referred to herein as a q-condition. Thus, a q-condition is a condition of the type: EQU (x.sub.i.sub..sub.1 =a.sub.i.sub..sub.1 .sub.j.sub.1) (x.sub.i.sub..sub.2 =a.sub.i.sub..sub.2 .sub.j.sub..sub.2 ) . . . (x.sub.i.sub..sub.q =a.sub.i.sub..sub.q .sub.j.sub..sub.q ) (2)
Association rules are "if-then" rules, wherein the condition is the single or conjunctive expression, as in equation (2). Such rules are referred to in the context of the present patent application and in the claims as "simple association rules." Thus, a simple association rule is a statement of the following type:
if (q-condition) then y=y.sub.1 (3)
The possible values of y.sub.1 are y.sub.1 =1 or y.sub.1 =0. The number s of records at which both the condition of the rule (the q-condition) and the rule's conclusion (y=y.sub.1) are fulfilled is referred to as the support, s, of the rule. The probability p of the rule is the probability that for a random record satisfying the rule's condition, the rule's conclusion is also fulfilled. Hence ##EQU1##
wherein m is the number of records satisfying the rule's condition.
Let p.sub.a be the a priori probability that the Field to Predict y possesses the predicted value 1, that is, ##EQU2##
where r is the number of records in the training database at which y=1. As mentioned above, the training database is assumed to be a real and representative sample of the investigated population. Hence, it can be assumed that the same a priori probability p.sub.a is valid on the overall population.
Which rules are interesting? In other words, about which rules can it be said that they were discovered not by chance, and are likely to be valid on the overall population? It can be assumed that these are rules fulfilled at a sufficiently large number of records and whose probability significantly deviates from p.sub.a. A formal statement for this intuitive notion is as follows: A user specifies a minimum support S.sub.min, and minimum admissible probabilities for a rule with y=1 (denoted by p.sub.1) and with y=0 (denoted by p.sub.0), 1-p.sub.0 &lt;p.sub.a &lt;p.sub.1. The objective of the data mining is then to determine association rules (rules of the type of equation (2)) for which s.gtoreq.S.sub.min and p.gtoreq.p.sub.1 (if y=1), or p.gtoreq.p.sub.0 (if y=0). Methods of data mining known in the art do not necessarily find all such rules exhaustively on the training database, and furthermore tend to require very substantial computing resources.
The rule defined by equation (3) can be expressed as a statement of conditional probability: EQU P(y=y.sub.1.vertline.(the q-condition))=p (4)
However, unlike equation (3), the number of records at which this statement is fulfilled (support s) is absent in equation (4). Therefore, association rules in a sense contain more information than conditional probabilities, which are applied in Bayes methods. Such methods are described, for example, by Friedman, in "On Bias, Variance, 0/1--Loss, and the Curse-of-Dimensionality," in Data Mining and Knowledge Discovery, 1(1), pages 54-77; and by Heckerman, in "Bayesian Networks for Data Mining," in Data Mining and Knowledge Discovery, 1(1), pages 79-119. These publications are incorporated herein by reference.
As mentioned above, the object of deriving association rules from the training database is to predict accurately the Field to Predict value for data outside the training database. To predict the value of output field y for a given record, all relevant association rules are selected, and using the respective probabilities of these rules, the conclusive probability that the Field to Predict possesses the predicted value is calculated. Such prediction gives especially accurate results when a great number of "strong" association rules (rules with a high probability and support) are found, and many of them are independent or nearly independent. Methods of finding strong rules are described, for example, by Piatetsky-Shapiro, in "Discovery, Analysis, and Presentation of Strong Rules," in Knowledge Discovery in Databases (Menlo Park, Calif.: AAAI Press), pages 229-248, which is incorporated herein by reference. These methods apply to finding simple association rules, as defined by expression (2), above.
The number of simple association rules determined is usually large and may even exceed (sometimes considerably) the number of records in the training database. Therefore, the average number of relevant rules for each record may also be large. Among the relevant rules there are often a large number of "weak" rules caused by "noise" --random data fluctuations--which complicate prediction. On the other hand, if only "strong" rules are allowed, there will be many records to which no relevant rules apply.