1.1 Field of the Invention
The present invention relates generally to a method, system and program product for uncovering relationships or association rules between items in large databases.
1.2 Description and Disadvantages of Prior Art
Data mining is an emerging technical area, whose goal is to extract significant patterns or interesting rules from large databases; in general the area of data mining comprises all methods which are applicable to extract “knowledge” from large amounts of existing data. The whole process is known as knowledge discovery in databases. Finding association rules is one task for which data mining methods have been developed for.
Association rule mining has been introduced by Agrawal et al. (refer for instance to R. Agrawal and R. Srikant, Fast algorithms for mining association rules, in Proc. 20th VLDB Conf., September 1994.) and was motivated by shopping basket analysis. The rules were generated to find out which articles or items in a shop are bought together. To be more general association rules can be used to discover dependencies among attribute values of records in a database. Even further specific basket data usually consists of a record per customer with a transaction date, along with items bought by the customer. An example of an association rule over such a database could be that 80% of the customers that bought bread and milk, also bought eggs. The data mining task for association rules can be broken into two steps. The first step consists of finding all the sets of items, called as itemsets, that occur in the database with a certain user-specified frequency, called minimum support. Such itemsets are called large itemsets. An itemset of k items is called a k-itemset. The second step consists of forming implication rules among the large itemsets found in the first step.
Several algorithms have been developed to generate efficiently association rules. The well known and very successful APRIORI algorithm has been disclosed by Agrawal et al. for instance in above mentioned document. The most important value with which association rules are measured is the support value which is the relative frequency of occurrence of one item or several items together in one rule.
Today generating association rules in case of very large data bases (number of entries several million records and above) can be extremely time consuming. Many algorithms proposed for data mining of association rules make repeated passes over the database to determine the commonly occurring itemsets (or set of items). For large databases, the I/O overhead in scanning the database can be extremely high. This processing time is not only required for executing the mining algorithms themselves. A lot of time is also spent during the preprocessing steps. This includes the processing time for import of data and also processing time for transforming data for applying the algorithm. This preparation can take several hours of expensive CPU-time even in case of large MVS-systems.
To improve this performance equation it has been suggested instead of taking the whole database for the generation of association rules just to draw a sample and generate the association rules on that basis. This teaching has been introduced by H. Toivonen, Sampling Large Databases for Association Rules, Proceedings of the 22nd VLDB Conference Mumbai (Bombay), India 1996 as well as Zaki, M. J., Parthasarathy, S., Li, W., Ogihara, M., Evaluation of Sampling for Data Mining of Association Rules, Computer Science Department, Technical Report 617, University of Rochester (1996).
Toivonen et al. stated an algorithm for detecting “exact” (not being based on some sample) association rules. Within this teaching sampling has been used only for the precalculation of the support values of the rules as one step in the algorithm; Toivonen et al. are completely mute about the idea of data mining for “estimated” (approximate) association rules based on some sample. Toivonen et al. also disclosed necessary bounds for sample sizes. Using an univariate approach the support value of an arbitrary association rule has been estimated. Toivonen et al. calculated the probability that an error between the true support value and the estimated support value exceeds a given threshold by using the binomial distribution and applying Chernoff bounds. With this they derived a formula for a sufficient sample size.
Zaki et al. took this idea up and published these bounds for approximate association rules generated under sampling. These bounds were also calculated using the univariate approach suggested by Toivonen including Chernoff bounds. It turned out by these investigations that these bounds are not very efficient since the required sample size can be very huge. As shown by Zaki et al. the required sample sizes can even become greater than the original database (!). Thus the current state of the art teaching is completely unsatisfactory and actually cannot be applied to real world problems.
Therefore, in principle the approach of data mining for association rules based on samples would allow to save processing time in the preprocessing step as well as in the analysis phase. But the fundamental problem which occurs is the accuracy of the generated association rules. If the sample is suitably chosen it is possible to estimate the error which appears by this approach. This error can be controlled by calculating sufficiently large sample sizes. But currently it is completely unclear how to determine reasonable sample sizes.
1.3 Objective of the Invention
The invention is based on the objective to improve the performance of the technologies for data mining of association rules.