Data mining is to extract the previously unknown and potentially useful information from a large database. An association rule mining is one of the most important techniques in data mining. The association rule was first proposed in supermarket sales. A large supermarket collects a lot of transaction records, and a supermarket manager hopes to find useful information from these records that can help decision makers to draw up sale plans.
A transaction record contains a set of items, and in the supermarket for example, an item means a product. Let I be the set of all items, and X and Y respectively represent sets of some items, then, an association rule may like X→Y, where X is an antecedent of the association rule, and Y is a consequent of the association rule, and X, Y ⊂I, X∩Y=Ø.
The association rule mining, in general, is divided into two steps: 1. finding all frequent itemsets; 2. producing all association rules based on the frequent itemsets found in the first step. The overall performance of the association rule mining is mainly depending on the first step since the second step is easy.
The Apriori algorithm based on a prefix tree is one of the most well-known and widely accepted methods to compute the frequent itemset. It needs to go through 2 steps to compute the frequent itemset having k items: 1. generating a k-itemset candidate based on a frequent (k-1)-itemset; 2. scanning a database to obtain a support of the k-itemset candidate, to further obtain a frequent k-itemset. The algorithm uses a prefix tree to represent frequent itemsets, and each node in the kth level represents a set of frequent k-itemsets.
However, as a depth of the algorithm increases a depth of the prefix tree increases, the algorithm consumes too much time in the second step of each cycle: scanning a database to obtain the support of the k-itemset candidate. This is because when the database is used for recursively traversing the prefix tree, the increase in the depth of the prefix tree unavoidably renders increase in the number of times for traversing the database, further resulting in considerable time consumption.
Therefore, it is necessary to improve the method for computing the support of the itemset candidate and the corresponding method for determining the frequent itemset.