Useful and previously unknown information can be discovered by mining databases. This new information can aid in critical analysis and decision making process. However, due to extensive computation that is often performed during the data mining process, the temporal complexity of data mining algorithms has been the key research issue. In the past several years, researchers have extensively focused on the development of new algorithms that yield faster and better performance.
An important application of data mining is market-basket analysis. For instance, online stores analyze their customers' online behavior and buying patterns data to offer and market other products and goods that the customers would likely be interested in. The stream of clicks through the store (“clickstream”) is logged into a data warehouse and analyzed to discover information that can be used to target marketing to the customer.
For example, a hypothetical online bookseller may wish to analyze which of its books are popular, which books correlate better with which category, and which books are often bought with what other items. To illustrate this scenario, let us suppose that a user has bought several different books within a particular time period, including “Object-Oriented Data Warehouse Design,” “Tools for statistical inference,” “Bayesian methods, Bayes, and Empirical Bayes: Methods for Data Analysis,” “A joke book on Windows,” “Indian cooking,” “Data mining in Java,” and “The Art of Japanese prints.”
In this example, the bookstore finds from the buying pattern of purchasing two books on data analysis and a data warehouse related book, that that the books “Tools for statistical inference” and “Bayesian methods, Bayes, and Empirical Bayes: Methods for Data Analysis” are closely related from the perspective of data mining. On the other hand, the “Object-Oriented Data Warehouse Design” is also related to data mining. Thus, the topic of data mining is an important aspect of the user's previous transactions. Therefore, next time when the user visits the online bookseller, a personalized web page can be displayed to recommend more books to the user.
One way to determine these recommendations is to use association rules to identify a combination of items that occur together with greater frequency than might be expected if the items were independent of one another. In this example, if the topics of “data mining” and “clustering” co-occur relatively more frequently, the following association rules may be in effect: Data MiningClustering and ClusteringData Mining. In the illustrative scenario, since the user purchased books related to data mining and since the Data MiningClustering association rule has been discovered for the online bookseller's system, the online bookseller can generate recommendations relating to clustering (e.g., “Clustering Algorithms”, “Clustering of Large Data Sets”) for the user. Similarly, other topics related to data mining can be discovered to generate new profitability opportunities.
As used herein, an “itemset” refers to a particular combination of items or attributes that co-occur in a database over a given time period. For example, if a user purchased the following books over the past year: “Object-Oriented Data Warehouse Design,” “Tools for statistical inference,” “Bayesian methods, Bayes, and Empirical Bayes: Methods for Data Analysis,” “A joke book on Windows,” “Indian cooking,” “Data mining in Java,” and “The Art of Japanese prints,” then every combination of these books would constitute a respective itemset. The “cardinality” of an itemset is the number of items in the itemset.
An association rule is typically denoted by XY, where X and Y are itemsets. Association rules are not logical implications or actual rules but an encoding of information about associations within itemsets. In other words, association rules indicate the presence and strength of coupling of items within the itemsets, and how itemset correlate among themselves, for a given set of transactions. From a statistical perspective, association rules locate the groups of itemsets that occur quite frequently. The co-occurrence of any particular itemset intuitively suggests that there may be some type (or form) of association or binding that holds a relationship between these items and itemsets together. For market-based analysis, the core idea is not to discover what type of relationship does the transaction database encodes but to figure out those frequent associations among itemsets that may inherently exist within them.
Apriori algorithms are popular data association rule mining techniques. Apriori algorithms use a “downward closure” property, which means so that any subset of frequent itemsets is also considered frequent. For example, if an itemset ABC, which consists of A, B, and Cm is found to be frequent, then all subsets of {A, B, C}, such as {A, B} for itemset AB and {B, C} for itemset BC are also considered frequent. However, apriori algorithms are very computationally expensive.
One example of an apriori algorithm for building association rules from transactional databases, to mine all association rules from a given database with respect to a set of minimal threshold measures, is disclosed in R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, “Fast discovery of association rules”, Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy (Eds.), AAAI Press, Melo Park, Calif., 1996. Since the original proposal of association rules almost ten years ago, the fundamental process of deriving these rules has remained more-or-less the same, with much emphasis placed on finding better performing algorithms. In most of these algorithms, the construction of association rules follows two distinct steps, which are: (1) extraction of frequent itemsets, to find out how many itemsets dominate or influence the data; and (2) generation of association rules from the set of extracted frequent itemsets.
In this approach, the set of minimal threshold measures are called support and confidence. The support measure indicates the frequency of itemsets throughout the database. For example, if an itemset AB had a frequency of occurrence of twenty (20) within fifty (50) transactions, then the support of itemset AB or “Supp(AB)” is computed as 20/50 or 0.40 in frequency-based term. On the other hand, the confidence measure indicates the weight of an association rule as a quotient of the support of the itemset that comprises all the components in the association rule and the support of a subset of the itemset. For example, the confidence of a rule (ABC) is equal to Supp(ABC)/Supp(AB). These support (Supp) and confidence (Conf) measurements are used in most of the algorithms that extract frequent itemsets and generate association rules
There is a long felt need for improving the computational complexity of these algorithms. More specifically, a substantial amount of research has been directed to the efficient extraction of frequent itemsets. FIG. 9 illustrates one approach for extracting frequent itemsets from a database of transaction. This approach involves a number of pass, in which frequent itemsets are identified, first starting at a cardinality of 1 (i.e. one-item itemsets) and increasing the cardinality until no more frequent itemsets can be extracted. This approach concludes by taking the union of all the frequent itemsets found for each level of cardinality.
Specifically, at step 901, the first set of frequent itemsets L(1) is initialized to include all the single items that are adequately supported, e.g. those which exceed a predetermined support value. Step 901 constitutes the first pass, and subsequent passes are controlled by block 903, in which the cardinality parameter K is increased as long as there are frequent itemsets at the previous cardinality.
In each pass, starting at step 905, a set of candidate itemsets C(K) of cardinality K is generated based on the frequent itemsets of the previous cardinality L(K−1). One way to generate the candidate itemsets is to perform a union of two of the frequent itemsets and ignoring any duplicates or any results that have a cardinality larger than the current cardinality K. These candidate itemset are pruned by removing any itemset that has an infrequent subset.
Block 907 controls a loop for scanning the database of transactions. Each transaction T is fetched from the database and each subset S of cardinality K of the items in the transaction T are proceed in a loop controlled by block 909. If the subset S is among the candidates C(K), then the support count for the member S in candidates C(K) is incremented (step 911). To facilitate the lookup of S in candidate C(K), a hash tree or hash set data structure may be employed. After each of the subsets S have been processed, execution loops back to block 907 where another transaction T is fetched from the database and processed in steps 909 and 911.
After all the transactions T in the database have been processed, execution proceeds to step 3, where the frequent members of C(K) are added to L(K), e.g. those which exceed a predetermined support value. Execution then returns back to block 903 where the cardinality is increased for another pass. If none of the members of C(K) are frequent enough (e.g. exceed the predetermined support value), then the loop controlled by block 903 terminates. The result is the union of the frequent itemsets L(K) for all the cardinalities that are processed.
The main shortcoming of this approach is performance. In particular, this approach requires multiple scans over all the transactions in the database. Scanning transactions in the database can be very slow because the database is often too large to be held in memory at one time. Even if the database is partitioned horizontally so that each partition can fit in memory, multiple scans over the transactions in the database are still needed. Sampling techniques such as the Dynamic Itemset Count (DIC) have been proposed to find itemsets using only a few passes over the database.
Therefore, there is a crucial need for improving the performance of determining which sets of items are frequent enough to be used in generating association rules.