Given modern computing capabilities, it is relatively easy to collect and store vast amounts of data, such as facts, numbers, text, etc. The issue then becomes how to analyze the vast amount of data to determine important data from less important data. The process of filtering the data to determine important data is often referred to as data mining. Data mining refers to a process of collecting data and analyzing the collected data from various perspectives, and summarizing any relevant findings. Locating frequent itemsets in a transaction database has become an important consideration when mining data. For example, frequent itemset mining has been used to locate useful patterns in a customer's transaction database.
Frequent Itemset Mining (FIM) is the basis of Association Rule Mining (ARM), and has been widely applied in marketing data analysis, protein sequences, web logs, text, music, stock market, etc. One popular algorithm for frequent itemset mining is the frequent pattern growth (FP-growth) algorithm. The FP-growth algorithm is used for mining frequent itemsets in a transaction database. The FP-growth algorithm uses a prefix tree (termed the “FP-tree”) representation of the transaction database, and is faster than the other mining algorithms, such as the Apriori mining algorithm. The FP-growth algorithm is often described as a recursive elimination scheme.
As part of a preprocessing step, the FP-growth algorithm deletes all items from the transactions that are not individually frequent according to a defined threshold. That is, the FP-growth algorithm deletes all items that do not appear in a user-specified minimum number of transactions. After preprocessing, a FP-tree is built, then the FP-growth algorithm constructs a “conditional pattern base” for each frequent item to construct a conditional FP-tree. The FP-growth algorithm then recursively mines the conditional FP-tree. The pattern growth is achieved by the concatenation of the suffix pattern with the frequent patterns generated from the conditional FP-tree.
Since the FP-growth algorithm has been recognized as a powerful tool for frequent itemset mining, there has been a large amount of research in efforts to implement the FP-growth algorithm in parallel processing computers. There have been two main approaches to implement FP-growth: the multiple tree approach and single tree approach. The multiple tree approach builds multiple FP-trees separately, which results in the introduction of many redundant nodes. FIG. 6 illustrates the multiple nodes generated by the conventional multiple tree approach with 1, 4, 8, 16, 32 and 64 threads (trees). The example database used to generate FIG. 6 is a benchmark dataset “accidents”, which can be found at the link “http://fimi.cs.helsinki.fi/data/” (the minimal support threshold is 200,000). As shown, the multiple tree approach will generate two (2) times as many tree nodes on four (4) threads, and about nine (9) times as many tree nodes on sixty-four (64) threads, as compared to only one thread. The shortcoming of building redundant nodes in multiple trees results in great memory demand, and sometimes the memory is not large enough to contain the multiple trees. The previous single approach builds only a single FP-tree in memory, but it needs to generate one lock that is associated with each of the tree nodes, thereby limiting scalability.