1. Field of the Invention
The present invention relates to an association rule generating method and a data mining system, and more particularly, to a method of generating association rules from a data stream, which is a non-limited data set composed of transactions continuously generated and a data mining system for generating association rules from a data stream.
2. Description of the Related Art
In general, in a data set to be subjected to data mining, all the unit information items appearing in an application domain are defined as unit items, and a set of unit information items having semantic synchrony in the application domain (that is, semantically generated at the same time) is defined as a transaction. The transaction has information of unit items having semantic synchrony, and a data set to be analyzed by data mining is defined by a set of transactions generated in a corresponding application domain.
When a set I of items is given, an association rule is represented, for example, in the form of X→Y(X⊂I, and Y⊂I). The association rule indicates the semantic relationship between the items of a data set. That is, when an itemset X appears in a transaction, the association rules predicts that another itemset Y also appears in the transaction with high probability. For a set of transactions, the support of an association rule X→Y is the fraction of transactions which contain both X and Y. The confidence of the association rule X→Y means the ratio of the number of transactions containing both X and Y to the number of transactions containing X. Two notions indicating the strength of an association rule are a minimum support Smin and a minimum confidence Cmin and disclosed in <R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules between Sets of Items in Very Large Databases” Proc. ACM SIGMOD Conf. Management of Data, pp. 207-216, May 1993>.
In general, when the minimum support Smin and the minimum confidence Cmin are given, the association rule is generated through the following two steps. In the first step, all the itemsets whose supports are greater than or equal to Smin are found. Combinations of the items are called frequent itemsets. Subsequently, in the second step, it is examined whether every non-empty proper subset of each frequent itemset can be the antecedent of an association rule. That is, for a frequent itemset e and one of its non-empty subsets q, an association rule q→e−q is generated only when S(e)/S(q)≧Cmin. The major bottleneck of this association rule mining is the first step. Therefore, most researches concentrate on devising an efficient method of finding frequent itemsets in a data set.
Apriori is a well-known algorithm for finding frequent itemsets from a finite set of transaction, which has been proposed in <R. Agrawal, T. Imielinski, and A. Swami, “Mining Association Rules between Sets of Items in Very Large Databases” Proc. ACM SIGMOD Conf. Management of Data, pp. 207-216, May 1993>. The Apriori algorithm is a multi-pass algorithm, so it needs up to n+1 scans on a data set when the maximal cardinality of a frequent itemset is n. For example, the following algorithms have been proposed to reduce the number of searches for transaction information: DIC<S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic Itemset Counting and Implication Rules for Market Basket Data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 255-264, 1997>; and Partition<A. Savasers, E. Omiecinski, and S. Navathe. An Efficient Algorithm for Mining Association Rules in Large Databases. In Proceedings of the 21st International Conference on Very Large Data Bases, pp. 432-444, 1995>. In an environment in which data sets are gradually increased, it is more efficient to use one of the incremental algorithms, such as BORDERS<Y. Aumann, R. Feldman, O. Lipshtat, and H. Manilla. Borders: An efficient algorithm for association generation in dynamic databases. In Journal of Intelligent Information System, Vol. 12, No. 1, pages 61-73, 1999> and DEMON <V. Ganti, J. Gehrke, and R. Ramakrishnan. DEMON: Mining and monitoring evolving data. In Proc. of the 16th Int'l Conference on Data Engineering, pages 439-448, San Diego, Calif., February 2000>. These incremental algorithms focus on efficiently utilizing the previous mining result of a data set in finding the up-to-date mining result. However, since the above algorithms need to search a large number of data sets and to manage each transaction information item, they are not suitable to search the frequent itemsets of a data stream.
For the second step of association rule mining, an online mining algorithm is proposed in <Charu C. Aggarwal, Philip S. Yu: A New Approach to Online Generation of Association Rules. IEEE Trans. Knowl. Data Eng. 13(4): 527-540, 2001>. Typically, a user is interested in only a few association rules and needs to run a query multiple times in order to find appropriate levels of Smin and Cmin. A directed acyclic graph, called an adjacency lattice, is composed of a set of all frequent itemsets in order to avoid redundancy. An approach similar to OLAP (online analytical processing) is employed for the on-line mining of association rules. However, these approaches for a finite set of transactions need to manage each transaction information item and to scan the data sets multiple times. Therefore, they are not suitable for finding frequent itemsets of a data stream.
A data stream is defined as an infinite set of data that is continuously generated at a rapid rate. Therefore, it is difficult to store all the elements in a separated limited space. Considering this characteristic, in order to extract knowledge for data stream information, the following requirements should be satisfied. First, the mining result should be generated with only one read of each transaction information item of the data stream. Second, a memory space for data stream analysis should be restricted finitely although new data elements are continuously generated in a data stream. Third, newly generated data elements should be processed as fast as possible. Finally, the up-to-date analysis result of a data stream should be provided instantly upon request. To satisfy these requirements, generally, data stream mining methods sacrifice the correctness of its analysis by allowing some errors.
Recently, various algorithms have been actively proposed to find semantic knowledge from a data stream. Among these algorithms, the sticky sampling method and the Lossy Counting algorithm (see G. S. Manku and R. Motwani. Approximate Frequency Counts over Data Streams. In Proc. of the 28th VLDB, pp. 346-357, 2002) and an estDec method (J. H. Chang and W. S. Lee. Finding recent frequent itemsets adaptively over online data streams. In Proc. of the 9th ACM SIGKDD, pp. 487-492, 2003) focus on finding frequent itemsets in a data stream. The Lossy Counting algorithm is a representative algorithm following a deterministic method, and finds a set of frequent itemsets generated from a data set when a minimum support and a maximum allowable error condition are given. The Lossy Counting algorithm manages the counts of possible frequent itemsets generated in each transaction forming the data stream and the errors thereof in a memory, and stores transactions newly generated in a buffer having a fixed size in the main memory. The stored transactions are batch-processed together. For the transactions stored in the buffer, the count of each unit item is updated, all the possible candidate items are generated from the transactions stored in the buffer, and the counts of the items are updated. For new possible frequent items, a maximum error that can be included in the corresponding item is estimated in consideration of the number of transactions generated previously, and the frequent items are managed together.
In this algorithm, the number of transactions that can be batch-processed is proportional to the size of the buffer. Therefore, as the size of the buffer increases, processing efficiency increases. However, consequently, a memory space required to search the frequent itemsets is also increased. In addition, in order to update the count of the itemset or obtain the mining result, all the itemsets managed in a secondary storage unit should be searched, which may result in long mining time. Therefore, this algorithm is not suitable for mining in an on-line data stream environment that can frequently obtain the mining result at any time.
The estDec method has been proposed to sufficiently minimize the number of itemsets to be monitored for finding frequent itemsets over an online data stream while searching the frequent itemsets.
In the estDec method, an itemset is regarded as a significant itemset if its current support is greater than or equal to a predetermined threshold value Ssig (Ssig<Smin). A prefix tree structure is employed to trace the current count of every significant itemset in the memory. Each significant itemset is represented by a node of the prefix tree. The total number of itemsets monitored in the memory is minimized by two major operations: delayed-insertion and pruning. The delayed-insertion delays the insertion of a new itemset in new transactions until the itemset becomes significant enough to be monitored. The pruning prunes the monitored itemset when the itemset turns out to be insignificant.
As disclosed in Ahmed Metwally, Divyakant Agrawal, Amr El Abbadi. Using Association Rules for Fraud Detection in Web Advertising Network. In Proc. of the 31st international conference on Very large data bases, August 2005, a simplified association rule between two items over a data stream is introduced for fraud detection in web advertising networks. To define an association rule x→y between two items x and y, their conditional frequency is continuously monitored over a data stream. This means the occurrence count of a pair (x, y), that is, the occurrence count when the item x is followed by the item y within a predetermined max span δ. A unique-count technique has been proposed to count the conditional frequencies of all the distinct pairs of items efficiently over a data stream. To the best of our knowledge, a general algorithm for generating association rules over a data stream has not been addressed before. A conventional two-step approach has been applied to an online data stream in order to frequently generate all the association rules at any time after all the up-to-date frequent itemsets are extracted. This approach requires an additional memory space for temporarily storing information on the supports of all the frequent itemsets. Furthermore, it is not efficient to trace the on-going changes of association rules over an on-line data stream.