This invention relates to data analytics and modeling, in particular, this invention is related to the mining of association rules for items in a database.
In data mining, the association rules model is a popular and important technique for discovering interesting relationships between items in large databases. One application of association rules is discovering patterns of co-occurrence of products in large-scale transaction data recorded by point-of-sale systems in supermarkets or online stores in order to increase sales. For example, the rule {bread, potatoes}=>{butter} found in the sales data of a supermarket would indicate that if a customer buys bread and potatoes together, he or she is likely to also buy butter. Such information can be used as the basis for decisions about marketing activities such as promotional pricing or product placement. Other applications of association rules analysis are the extraction of important patterns in web usage or bioinformatics.
Generally, association rule mining has two main parts: (1) finding frequent itemsets with support at or above a minimum support; and (2) creating association rules from the frequent itemsets, using a minimum confidence. Association rule mining is defined as follows. Let I={i1, i2, . . . , in} be a set of items. A subset of I is called an itemset. A rule is defined as an implication of the form XY where X,Y⊂ I and X∩Y=ϕ. X and Y are called “antecedent” (left hand side) and “consequent” (right hand side) of the rule respectively. The “support” sup(X) of an itemset X is defined as the proportion of transactions in the data set which contain the itemset. The “confidence” of a rule is defined
            as      ⁢                          ⁢              conf        ⁡                  (                      X            ⇒            Y                    )                      =                  sup        ⁡                  (                      X            ⋃            Y                    )                            sup        ⁡                  (          X          )                      ,where sup(X∪Y) means “support for occurrences of transactions where X and Y both appear”. Typically, “minimum support” and “minimum confidence” are the main criteria specified for building association rules.
The Apriori algorithm is a well-known algorithm for finding frequent itemsets. This algorithm uses the fact that all subsets of a frequent itemset are also frequent. It is an iterative method, generating candidate (k+1)-itemsets from the frequent k-itemsets, then counting those candidate itemsets to find their support value and to select frequent itemsets. Every “layer search” at level k will scan a transaction table once to count the absolute support of k-itemsets. Then the infrequent k-itemsets (i.e., those having supports lower than the specified threshold) are removed. The remaining itemsets are frequent k-itemsets. Then the candidate (k+1)-itemsets are created based on the frequent k-itemsets, and the search at level k+1 starts. The algorithm stops when no candidate itemsets for the next level can be created or when a maximum rule size is reached. This algorithm however requires many data passes. A map-reduce framework may be used with the Apriori algorithm to improve its implementation. map-reduce is a programming model and an associated implementation for processing and generating large data sets with a parallel, distributed algorithm on a cluster. Iterative map-reduce jobs may be performed to find frequent itemsets
The Apriori/map-reduce approach addresses the efficiency of the first part of association rule mining, where frequent itemsets are found by scanning the transaction dataset. The second part of association rule mining, the creation of mining rules based on the frequent itemsets, can also be time consuming when there is a large number of long frequent itemsets. It is because for each frequent k-itemset, there will be 2k−1 potential rules to be checked against the minimum confidence. However, the Apriori/map-reduce approach does not address the efficiency of the second part.