1. Field of the Invention
The present invention generally relates to mining database association rules, and more particularly to an interactive process of mining the most important database association rules.
2. Description of the Related Art
A data-set is a finite set of records. In the present invention, a record is simply an element on which boolean predicates (e.g., conditions) are applied. A rule consists of two conditions (e.g., antecedent and consequent) and is denoted as Axe2x86x92C where A is the antecedent and C is the consequent. A rule constraint is a boolean predicate on a rule. Given a set of constraints N, it is said that a rule r satisfies the constraints in N if every constraint in N evaluates to true given r. Some common examples of constraints are item constraints and minimums on support and confidence (e.g., see Srikant et al., 1997, Mining Association Rules With Item Constraints, In Proceedings of the Third International Conference On Knowledge Discovery in Databases and Data Mining, 67-73; Agrawal et al. 1993, Mining Associations between Sets of Items in Massive Databases. In Proceedings of the 1993 ACM-SIGMOD International Conference on Management of Data, 207-216, all incorporated herein by reference). The input to the problem of mining optimized rules is a 5-tuple  less than U, D, xe2x89xa6, C, N greater than , where U is a finite set of conditions, D is a data-set, xe2x89xa6 is a total order on rules, C is a condition specifying the rule consequent, and N is a set of constraints on rules.
For example, as discussed in Agrawal el al. 1993, Supra, using a supermarket with a large collection of items as an example, typical business decisions that the management of the supermarket has to make include what to put on sale, how to design coupons, how to place merchandise on shelves in order to maximize the profit, etc. Analysis of past transaction data is a commonly used approach in order to improve the quality of such decisions. Until recently, however, only global data about the cumulative sales during some time period (a day, a week, a month, etc.) was available on the computer. Progress in bar-code technology has made it possible to store the so called basket data that stores items purchased on a per-transaction basis. Basket data type transactions do not necessarily consist of items bought together at the same point of time. It may consist of items bought by a customer over a period of time. Examples include monthly purchases by members of a book club or a music club.
Several organizations have collected massive amounts of such data. These data sets are usually stored on tertiary storage and are very slowly migrating to database systems. One of the main reasons for the limited success of database systems in this area is that conventional database systems do not provide the necessary functionality for a user interested in taking advantage of this information.
Thus, the large collection of basket data type transactions are mined for association rules between sets of items with some minimum specified confidence. An example of such an association rule is the statement that 90% of transactions that purchase bread and butter also purchase milk. The antecedent of this rule consists of bread and butter and the consequent consists of milk alone. The number 90% is the confidence factor of the rule.
Rule mining enhances databases with functionalities to process queries such as finding all rules that have a specific product as consequent (these rules may help plan what the store should do to boost the sale of the specific product), or may help shelf planning by determining if the sale of items on shelf A is related to the sale of items on shelf B), and finding the best rule that has xe2x80x9cbagelsxe2x80x9d in the consequent (xe2x80x9cbestxe2x80x9d can be formulated in terms of the confidence factors of the rules, or in terms of their support, i.e., the fraction of transactions satisfying the rule)
For, example, let I=I1, I2, . . . , Im be a set of binary attributes, called items. Let T be a database of transactions. Each transaction t is represented as a binary vector, with t[k]=1 if t bought the item Ik, and t[k]=0 otherwise. There is one tuple in the database for each transaction. Let X be a set of some items in I. We say that a transaction t satisfies X if for all items Ik in X, t[k]=1.
An association rule has an implication of the form Xxe2x86x92Ij, where X is a set of some items in I, and Ij is a single item in I that is not present in X. The rule Xxe2x86x92Ij is satisfied in the set of transactions T with the confidence factor 0xe2x89xa6cxe2x89xa61 if at least c% of transactions in T that satisfy X also satisfy Ij. The notation Xxe2x86x92y Ij | c is used to specify that the rule Xxe2x86x92Ij has a confidence factor of c.
Given the set of transactions T, the conventional process is interested in generating all rules that satisfy certain additional constraints of two different forms. The first form is syntactic constraints. These constraints involve restrictions on items that can appear in a rule. For example, the only interesting rules may be the rules that have a specific item Ix appearing in the consequent. Combinations of the above constraints are also possible, and all rules that have items from some predefined itemset X appearing in the consequent, and items from some other itemset Y appearing in the antecedent may be requested.
The second form is support constraints. These constraints concern the number of transactions in T that support a rule. The support for a rule is defined to be the fraction of transactions in T that satisfy the union of items in the consequent and antecedent of the rule.
Support should not be confused with confidence. The confidence of a rule is the probability with which the consequent evaluates to true given that the antecedent evaluates to true in the input data-set, which is computed as follows:                               conf          ⁡                      (                          A              →              C                        )                          =                              sup            ⁡                          (                              A                →                C                            )                                            sup            ⁡                          (              A              )                                                          (        1        )            
While confidence is a measure of the rule""s strength, support corresponds to statistical significance. Besides statistical significance, another motivation for support constraints comes from the fact that usually the only interesting rules are the ones with support above some minimum threshold for business reasons. If the support is not large enough, it means that the rule is not worth consideration or that it is simply less preferred (may be considered later).
When mining an optimal disjunction, a set of conditions A⊂U is treated as a condition itself that evaluates to true if and only if one or more of the conditions within A evaluates to true on the given record. For both cases, if A is empty then the set of conditions A⊂U always evaluates to true. Algorithms for mining optimal conjunctions and disjunctions differ significantly in their details, but the problem can be formally stated in an identical manner. More specifically, the manner in which the optimized rule mining may be formally stated is finding a set A1⊂U such that A1 satisfies the input constraints, and there exists no set A2⊂U such that A2 satisfies the input constraints and A1 less than A2. Algorithms for mining optimal disjunctions typically allow a single fixed conjunctive condition without complications (see Rastogi et al. 1998, Mining Optimized Association Rules with Categorical and Numeric Attributes, In Proceedings of the 14th International Conference on Data Engineering, 503-512, incorporated herein by reference).
Any rule Axe2x86x92C whose antecedent is a solution to an instance I of the optimized rule mining problem is said to be I-optimal (or just optimal if the instance is clear from the context). For simplicity, a rules antecedents (denoted with A and possibly some subscript) and rules (denoted with and possibly some subscript) are sometimes treated interchangeably since the consequent is always fixed and clear from the context.
The support and confidence values of rules are often used to define rule constraints by bounding them above a pre-specified value known as minsup and minconf respectively (see Agrawal et al., 1996, Fast Discovery of Association Rules, In Advances in Knowledge Discovery and Data Mining, AAAI Press, 307-328), and also to define total orders for optimization (see Fukuda et al. 1996, Data Mining Using Two-Dimensional Optimized Association Rules: Scheme, Algorithms, and Visualization. In Proceedings of the 1996 ACM-SIGMOD International Conference on the Management of Data, 13-23, incorporated herein reference, and Srikant et al., supra). The support of a condition A is equal to the number of records in the data-set for which A evaluates to true, and this value is denoted as sup(A). The support of a rule Axe2x86x92C, denoted similarly as sup(Axe2x86x92C), is equal to true. The antecedent support of a rule is the support of its antecedent alone. The confidence of a rule is the probability with which the consequent evaluates to true given that the antecedent evaluates to true in the input data-set, which is computed as follows:       conf    ⁡          (              A        →        C            )        =            sup      ⁡              (                  A          →          C                )                    sup      ⁡              (        A        )            
Many previously proposed algorithms for optimized rule mining solve specific restrictions of the optimized rule mining problem. For example, Webb, 1995, OPUS: An Efficient Admissible Algorithm for Unordered Search, Journal of Artificial Intelligence Research, 3:431-465 (incorporated herein by reference), provides an algorithm for mining an optimized conjunction under some restrictions. The restrictions are that U contains an existence test for each attribute/value pair appearing in a categorical data-set outside a designated class column, xe2x89xa6 orders rules according to their laplace value, and N is empty.
Fukuda et al., supra, provides algorithms for mining an optimized disjunction. The algorithms are such that U contains a membership test for each square of a grid formed by discretizing two pre-specified numerical attributes of a data-set (a record is a member of a square if its attribute values fall within the respective ranges), xe2x89xa6 orders rules according to either confidence, antecedent support, or a notion called gain, and N includes minimums on support or confidence, and includes one of several possible xe2x80x9cgeometry constraintsxe2x80x9d that restrict the allowed shape formed by the represented set of grid squares.
Rastogi et al., supra look at the problem of mining an optimized disjunction where U includes a membership test for every possible hypercube defined by a pre-specified set of record attributes with either ordered or categorical domains, xe2x89xa6 orders rules according to antecedent support or confidence, and N includes minimums on antecedent support or confidence, a maximum k on the number of conditions allowed in the antecedent of a rule, and a requirement that the hypercubes corresponding to the condition of a rule are non-overlapping.
In general, the optimized rule mining problem, whether conjunctive or disjunctive, is NP-hard (see Morishita, S. 1998, On Classification and Regression, In Proceedings of the First International Conference on Discovery Sciencexe2x80x94Lecture Notes in Artificial Intelligence 1532:40-57, incorporated herein by reference). However, features of a specific instance of this problem can often be exploited to achieve tractability. For example, in Fukuda et al., supra, the geometry constraints are used to develop low-order polynominal time algorithms. Even in cases where tractability is not guaranteed, efficient mining in practice has been demonstrated (see Nakaya et al. 1999, Fast Parallel Search for Correlated Association Rules, unpublished manuscript, incorporated herein by reference, Rastogi et al., supra, and Webb 1995, supra). The theoretical contributions in this invention are conjunction/disjunction neutral. However, the conjunctive case is focused on invalidating the practicality of these results through empirical evaluation.
It is, therefore, an object of the present invention to provide a structure and method for identifying database association rules which includes mining first database association rules, the first database association rules having ratings with respect to a plurality of metrics, selecting second database association rules from the first database association rules, each of the second database association rules having a highest rating with respect to a different metric of the metrics, and interactively changing the metrics and repeating the selecting to identify most important ones of the databases association rules for a given set of metrics. The mining produces a partial order and identifies upper and lower support-confidence borders of the database association rules. All of the database association rules fall within the upper and lower support-confidence borders and only rules at the upper and lower support-confidence borders are optimal rules.
An embodiment of the invention is a method of mining most interesting rules from a database which includes generating a set of maximally general and maximally predictive rules from the database, specifying metrics and population constraints to a query engine, and selecting most interesting rules from the set of maximally general and maximally predictive rules based on the metrics. The generating produces a partial order of the database association rules and identifies upper and lower support-confidence borders of the database association rules. The set of maximally general and maximally predictive rules fall within the upper and lower support-confidence borders and only rules at the upper and lower support-confidence borders are optimal rules. The rules are optimal rules from different equivalence classes, and comply with a fixed consequent. The metrics include at least one of confidence, support gain, variance, chi-squared value, entropy gain, gini, laplace, lift, and conviction.
Another embodiment of the invention is a process for identifying database association rules which includes mining first database association rules (the first database association rules have ratings with respect to a plurality of metrics and population constraints), selecting second database association rules from the first database association rules (each of the second database association rules having a highest rating with respect to a different metric of the metrics), interactively changing the metrics and the population constraints, and repeating the selecting to identify most important ones of the databases association rules for a given set of metrics. The mining produces a partial order of the database association rules and identifies upper and lower support-confidence borders of the database association rules. The database association rules fall within the upper and lower support-confidence borders and only rules at the upper and lower support-confidence borders are optimal rules. The second database association rules include maximally general and maximally predictive rules from the database and a plurality of optimal rules from different equivalence classes. The first database association rules comply with at least one consequent. The metrics include at least one of confidence, support gain, variance, chi-squared value, entropy gain, gini, laplace, lift, and conviction.
Yet another embodiment of the invention is a system for mining optimal association rules which includes an engine mining first database association rules (the first database association rules have ratings with respect to a plurality of metrics and population constraints) and a query engine selecting second database association rules from the first database association rules. Each of the second database association rules has a highest rating with respect to a different metric of the metrics. The query engine interactively changes the metrics and population constraints and identifies the most important ones of the databases association rules for a given set of metrics. The query engine produces a partial order of the database association rules and identifies upper and lower support-confidence borders of the database association rules. The database association rules fall within the upper and lower support-confidence borders and only rules at the upper and lower support-confidence borders are optimal rules. The second database association rules include maximally general and maximally predictive rules from the database and a plurality of optimal rules from different equivalence classes. The first database association rules comply with at least one consequent. The metrics include at least one of confidence, support gain, variance, chi-squared value, entropy gain, gini, laplace, lift, and conviction.