Association Rule Mining (ARM) in large transactional databases is a central problem in the field of knowledge discovery. The input to the ARM is a database in which objects are grouped by context. An example of such a grouping would be a list of items grouped by the customer who bought them. A then finds sets of objects which tend to associate with one another. Given two distinct sets of objects, X and Y, we say Y is associated with X if the appearance of X in a certain context usually implies that Y will appear in that context as well. If X usually implies Y, we then say that the rule xY is confident in the database.
Typically, an association rule is of practical interest only if it appears in more than a certain number of contexts. If it does, we say that the rule is frequent, i.e., that it has a large support. The thresholds of support (MinSup) and confidence (MinConf) are parameters that are used to define which association rules are of interest. These parameters are usually supplied by the user according to his needs and resources. The solution to the ARM problem is a list of all association rules that are both frequent and confident in that database. Such lists of rules have many applications in the context of understanding, describing and acting upon the database.
A variety of algorithms have been developed for ARM, such algorithms are described, for example, by Agrawal and Srikant in “Fast Algorithms for Mining Association Rules,” Proceedings of the 20th International Conference on Very Large Databases (VLDB94—Santiago, Chile, 194), pages 487-499, which is incorporated herein by reference. It has been shown that the major computational task in ARM is the identification of all the frequent itemsets, i.e., those sets of items which appear in a fraction greater than MinSup of the transactions. Association rules can then be produced from these frequent itemsets in a straightforward manner. For example, once it is known that both {Pasta Sauce} and {Pasta Sauce, Parmesan} are frequent itemsets, the association rule {Pasta Sauce}{Parmesan} is obviously frequent, and all that remains is to check whether the association is confident. Because databases are often very large and are typically stored in secondary memory (disk), ARM algorithms known in the art are mainly concerned with reducing the number of database scans required to arrive at the desired collection of frequent itemsets, and hence to determine the confident association rules.
In the above-mentioned paper, Agrawal and Srikant describe an ARM algorithm that they call “Apriori.” The algorithm begins by assuming that any item in a candidate to be a frequent itemset of size k=1. Apriori then performs several rounds of a two-phased computation. In the first phase of the kth round, the database is scanned, and support counts are calculated for all k-size candidate itemsets. Those candidate itemsets that have support above the user-supplied MinSup threshold are considered frequent itemsets. In the second phase, candidate k+1-size itemsets are generated from the set of frequent k-size itemsets if and only if all their k-size subsets are frequent. The rounds terminate when the set of frequent k-size itemsets is empty.
In Distributed Association Rule Mining (D-ARM), the ARM problem is restated in the context of distributed computing. In D-ARM, the database is partitioned among several nodes that can perform independent parallel computations, as well as communicate with one another. A number of algorithms have been proposed to solve the D-ARM problem, particularly for share-nothing machines (i.e., distributed computing systems in which each node uses its own separate memory). An exemplary D-ARM algorithm is described by Agrawal and Shafer in “Parallel Mining of Association Rules,” IEEE Transactions on Knowledge and Data Engineering 8:6 (1996), pages 962-969, which is incorporated herein by reference. D-ARM has a major advantage over conventional ARM, in that it parallelizes disk I/O operations. The main difficulty for D-ARM algorithms is communication complexity among the nodes. The most important factors in the communication complexity of D-ARM algorithms are the number of partitions (or computing nodes), n, and the number of itemsets, |C|, considered by the algorithm.
Agrawal and Shafer present two major approaches to D-ARM: data distribution (DD) and count distribution (CD). DD focuses on the optimal partitioning of the database in order to maximize parallelism. CD, on the other hand, considers a setting in which the data are arbitrarily partitioned horizontally among the parties to begin with, and focuses on parallelizing the computation. (Horizontal partitioned means that each partition includes whole transactions, in contrast with vertical partitioning, in which the same transaction is split among several parties.) The DD approach is not always applicable, since at the time the data are generated, they are often already partitioned. In many cases, the data cannot be gathered and repartitioned for reasons of security and secrecy, cost of transmission, or just efficiency. DD is thus more applicable to systems that are dedicated to performing D-ARM. CD, on the other hand, is typically a more appealing solution for systems that are naturally distributed over large expanses, such as stock exchange and credit card systems.
The CD algorithm presented by Agrawal and Shafer is a parallelization of the Apriori algorithm described above. In the first phase of CD, each of the nodes performs a database scan independently on its own partition. Then the nodes exchange their scan results, and a global sum reduction is performed on the support counts of each candidate itemset. Those itemsets whose global support is larger than MinSup are considered frequent. The second phase, calculating the candidate k+1-size itemsets, can be carried out without any communication, because the calculation depends only on the identity of the frequent k-size itemsets, which is known to all parties by this time. Thus, CD fully parallelizes the disk I/O complexity of Apriori and performs roughly the same computations. CD also requires one synchronization point on each round and carries an O(|C|·n) communication complexity penalty. Since typical values for |C| are tens or hundreds of thousands, CD is not scalable to large numbers of partitions.
In order to reduce this communication load, Cheung et al. introduced the FDM algorithm, in “A Fast Distributed Algorithm for Mining Association Rules,” Proceeding of the 1996 International Conference on Parallel and Distributed Information Systems (Miami Beach, Fla., 1996), pages 31-44, which is incorporated herein by reference. FDM takes advantage of the fact that ARM algorithms look only for rules that are globally frequent. FDM is based on the inference that in order for an itemset to appear among all the transactions in the database with a given frequency, there must be at least one partition of the database in which the itemset appears at the given frequency or greater. Therefore, in FDM, the first stage of CD is divided into two rounds of communication: In the first round, every party names those candidate itemsets that are locally frequent in its partition (because they appear in the partition with a frequency greater than or equal to MinSup/|database|). In the second round, counts are globally summed only for those candidate itemsets that were named by at least one party. If the probability that an itemset will have the potential of being frequent is Prpotential, then FDM only communicates Prpotential·|C| of the itemsets. It thus improves the communication complexity to O(Prpotential·|C|·n).
FDM is problematic when large numbers of nodes are involved in the computation, because Prpotential is not scalable in n, and quickly increases to 1 as n increases, particularly in inhomogeneous databases. This problem was pointed out by Cheung and Xiao in “Effect of Data Skewness in Parallel Mining of Association Rules,” Second Pacific-Asia Conference of Knowledge Discovery and Data Mining (1998), pages 48-60, which is incorporated herein by reference. The authors show that as the inhomogeneity of the database increases, FDM pruning techniques become ineffective.
Over the past few years, distributed information systems have become a mainstream computing paradigm, and the wealth of information available in these systems is constantly expanding. Examples of distributed information resources of this sort include a company's Virtual Private Network, a multi-server billing center, a network of independent stockbrokers, and a peer-to-peer MP3 library, such as Napster™. There is a growing need for tools that can assist in understanding and describing such information. These new databases differ from distributed databases of the past, in that the partitioning of the data is usually skewed, the connections between partitions are sparse and often unreliable, and variable throughputs and latencies may apply to different nodes. These characteristics accentuate the inadequacies of D-ARM methods known in the art.