Association rule mining (“ARM”) is one of the data mining techniques that extract associations between sets of products, services, or any variables present in transactions. Systems often apply ARM to extract relationships and trends in domains such as retail, telecommunications, banking, bioinformatics, healthcare, and catering to name a few. Resulting association rules help those in an industry to make informed decisions based on the relationships between products and/or services. For example, association rules may help to identify cross-selling and up-selling opportunities in a retail industry.
Association rules may be mined from a set of transactions in a dataset, for example transactions collected at a point of sale (“PoS”). In other words, D={t1, t2, . . . , tn} where D is a dataset of transactions tx for x=1 to n, and n is the total number of transactions. Each transaction includes a set of one or more items (“itemsets”) out of a set of all items. Items may be, for example, products or services sold, viewed, downloaded, streamed, and the like. In other words I={i1, i2, . . . , ik} where I is the set of all items ix for x=1 to k (i.e., I is a k-itemset), and k is the total number of items. A rule is defined as an implication of the form X→Y where X, Y⊂I and X∩Y=NULL and where X and Y are the antecedent and consequent itemsets of the rule respectively.
To select useful rules from a set of all possible rules, constraints on various measures of significance and interest may be useful. An important property of an itemset is its “support count”, which refers to the total number of transactions in a dataset that contain a particular itemset. Additionally, the “support” of an itemset indicates the applicability of a rule for a given dataset. A rule that has very low support may occur simply by chance. The formal definition of support may be given by equation 1:
                              s          ⁡                      (                          X              →              Y                        )                          =                              σ            ⁡                          (                              X                ⋃                Y                            )                                n                                    (        1        )            where X is the antecedent itemset, Y is the consequent itemset, σ is a support function, and n is the number of transactions in the dataset.
The “confidence” of an itemset indicates how often itemset Y appears in transactions that contain itemset X. The formal definition of confidence may be given by equation 2:
                              c          ⁡                      (                          X              →              Y                        )                          =                              σ            ⁡                          (                              X                ⋃                Y                            )                                            σ            ⁡                          (              X              )                                                          (        2        )            where X is the antecedent itemset, Y is the consequent itemset, and σ is a support function. Confidence is a measure of accuracy or reliability of the inference made by the rule that the number of instances that the association rules will predict correctly out of all instances it applies to.
Association rules generally are required to satisfy user-specified minimum support and user-specified minimum confidence thresholds at the same time to be considered useful. However, there is no defined approach for a user to set the values of minimum support and minimum confidence. In each instance, a user must have specific knowledge about the working mechanism of the algorithm being used as well as knowledge about the data in order to determine useful values for minimum support and minimum confidence.
Typically, the task of association rule mining is carried out in two steps. First, a minimum support constraint is applied to itemsets in a dataset to determine all frequent itemsets (i.e., frequent itemsets are all itemsets having at least a threshold support). Next, a minimum confidence constraint is applied to all frequent itemsets to form rules. However, finding all frequent itemsets in a dataset may be computationally intensive as the number of frequent itemsets grows exponentially in relation to the number of transactions. To cope with the exponential growth, various algorithms such as the Apriori algorithm, the frequent pattern growth (“FP-growth”) algorithm, and others have been used to more efficiently mine association rules. However, known algorithms all have drawbacks. For example, the Apriori algorithm requires significant computational effort on a large dataset. Additionally, the FP-growth algorithm is memory intensive if many or all transactions are unique in the transaction database. Additionally, the number of association rules extracted by the FP-growth algorithm is less compared to the number of association rules extracted according to the Apriori algorithm. Other known algorithms have similar deficiencies.
Sampling has also been employed to speed up frequent itemset mining. Methods have been disclosed for taking random samples of transactions or heuristic samples of transactions to mine frequent itemsets. However, sampled transactions resulting from random sampling may not accurately represent the actual population. Additionally, while heuristic sampling may be more accurate than random sampling, heuristic sampling methods are computationally intensive and may require excessive computing resources. Improved sampling methods are desired.
While systems and methods are described herein by way of examples and embodiments, those skilled in the art recognize that systems and methods for mining association rules are not limited to the embodiments or drawings described. It should be understood that the drawings and description are not intended to be limiting to the particular form disclosed. Rather, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope of the appended claims. Any headings used herein are for organizational purposes only and are not meant to limit the scope of the description or the claims. As used herein, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include”, “including”, and “includes” mean including, but not limited to.