1. Field of the Invention
The present invention relates to a method for finding specific information by analyzing a large amount of data sets and a method for finding frequent itemsets in a data mining system realized using the same and, more particularly, to a method for finding a support in real time defined by the ratio of a transaction, in which a specific itemset appears, to the total number of transactions constituting a data set using a value indicating the ratio of a frequency count of the corresponding itemset in the total data sets to a frequent itemset having a frequency count larger than a support threshold defined previously in an indefinite data set (hereinafter, referred to as “a data stream”) that continuously accumulates data newly generated as time goes by.
2. Description of Related Art
In a data set that is an object of data mining, a unit information that appears in an application is generally defined as an item, and a group of unit information that has a significant concurrency in the application (i.e., significantly appears at the same time) is defined as a transaction. The transaction includes information of items having a significant concurrency and the data set that for the data mining is defined as a set of transactions that appear in the corresponding application.
Conventional methods for finding frequent itemsets aim at defining data sets fixed at the point of time of data mining analysis and finding frequent itemsets in the fixed data sets. Since these conventional methods define the data sets fixedly, only the information accumulated at a specific point of time is the object of the data mining analysis. However, since the information included in transactions newly generated as time goes by may be changed and, further, the mining result for newly generated data sets may cause a problem of availability in the near future if new data are generated continuously, it is necessary to carry out the mining operation again for the total data sets including the data sets of the previous mining objects and the new transactions generated subsequently in order to obtain an accurate result covering the newly generated data sets. In general, since the mining operation using the conventional mining method requires longer time for operation and much more capability in computer process if the data sets are larger, it can not provide the mining result in real time.
An operation of finding frequent itemsets is to find all itemsets, of which the ratio of the number of transactions, in which itemsets appear, to the total number of the transactions in an indefinite data set has a support larger than a specific support threshold. For finding frequent itemsets under the circumstances that the data sets are increased intermittently, it is required to apply incremental mining methods. Among various data structures proposed for finding frequent itemsets, as a method of reducing the memory usage for the incremental mining, an itemset tree structure disclosed in “A. Hafez, J. Deogun, and V. V. Raghavan, The Item-Set Tree: A data Structure for Data Mining, In proc. Of 1st int'l conf on datawarehousing and knowledge discovery, pages 183˜192, August 1999”, basically sets and manages the total transactions in a node. If a new transaction is generated, the itemset tree is generated through the following two steps. The first step is to generate a node for a new itemset and the second step is to update the frequency counts of nodes to be updated by the new itemset. When searching the tree to generate a node for a new itemset, if a common itemset is found by comparing the nodes that constitute a node of the itemset tree with the itemsets generated by a newly generated transaction, the common itemset is shared as an upstream node and the other itemsets are generated as downstream nodes. Since the respective nodes in the itemset tree structure manage the exact frequency counts, the frequency counts are updated by searching the whole tree in the second step. The itemset tree can reduce the memory usage effectively by sharing the nodes in processing a large amount of data. However, since it should search the whole tree to update the frequency counts of the respective nodes, it requires longer time for the process. Moreover, since it should accumulate the information on all transactions generated in memory, it has no function of dynamically adjusting the size of the itemset tree. Due to such drawbacks, the itemset tree structure is not suitable for the method of finding frequent itemsets over an online data stream that requires mining results in real time.
Methods for finding frequent itemsets over data streams include Count Sketch algorithm proposed in “M. Charikar, K. Chen, and M. Farach-Colton, Finding Frequent Items in Data Streams, Proc. 29th Int'l Colloq. Automata, Language and Programming, 2002” and Lossy Counting algorithm proposed in “G. S. Manku and R. Motwani, Approximate Frequency Counts over Data Streams, Proc. 28th Int'l Conf. Very Large Data Bases (VLDB 02), 2002”. However, these two algorithms include some errors in the frequent itemsets or the frequency counts acquired as mining results, since it is impossible to maintain the information of all transactions generated previously. The Count Sketch algorithm focuses on finding frequent itemsets over data streams. The Count Sketch algorithm generates a set of items that satisfy the threshold or more by estimating the frequency counts of the items in the transactions generated up to now. On the contrary, the Lossy Counting algorithm finds the frequent itemsets generated in the data stream, if there are given a minimum support and a maximum allowable error. The respective transactions generated in the data stream are filled in buffers of a fixed size maintained in main memory and batch processed in the unit of the buffer, and the management structure of the frequency counts of the respective itemsets is maintained in an auxiliary device. The frequency counts of the items are updated for the transactions filled in the buffers, and a new possible frequent itemset is also managed by estimating the maximum error to be contained in the corresponding itemset by considering the number of transactions generated previously. The Lossy Counting algorithm is influenced by the size of the buffer. If the size of the buffer is set large, it is possible to batch process a large amount of transactions, thus reducing the number of data operations. However, since it requires a large memory usage, it is necessary to adjust the size of the buffer appropriately. Although the Lossy Counting algorithm includes some errors in mining results, it can reduce the memory usage and find results with one search during the mining process, which is useful for the data stream mining. However, since the Lossy Counting algorithm processes in the unit of the buffer, it is inefficient over an online data stream for obtaining mining results promptly at a certain point of time.
An estDec method for finding frequent itemsets over online data streams, disclosed in “J. H. Chang, W. S. Lee, Finding recent frequent itemsets adaptively over online data streams, In Proc. Of the 9th ACM SIGKDD Int'l Conf. on Knowledge Discovery and Data Mining, Washington, D.C., August 2003. (CIKM 01), pp. 263-270, 2001”, has a difference from the Lossy Counting algorithm in view of the fact that the transactions constituting a data stream are generated and processed simultaneously. The estDec method uses a prefix lattice tree, proposed in “S. Brin, R. Motwani, J. D. Ullman, and S. Tsur, Dynamic Itemset Counting and Implication Rules for Market Basket Data, Proc. ACM SIGMOD Int'l Conf. Management of Data (SIGMOD 97), pp. 255-264, 1997” and “M. J. Zaki, Generating Non-Redundant Association Rules, Proc. 6th ACM SIGKDD Int'l Conf. Knowledge Discovery and Data Mining (KDD 00), pp. 34-43, 2000”, and manages only itemsets that are likely to be frequent itemsets through delayed insertion and pruning operations using the prefix lattice tree in memory. In the estDec method, the itemset that appears in the data stream is managed in the prefix tree in memory in the following two cases: First, an itemset of 1 in length is inserted into the prefix tree unconditionally and managed. Second, if a new itemset of n(n≧2) in length is generated and it is a significant itemset having a large support to the extent that the corresponding itemset becomes a frequent itemset in the near future, the itemset is inserted into the corresponding prefix tree. That is, the support of the itemset that are not a significant itemset so far is estimated from the subitemsets of the corresponding itemset and, if the estimated value is greater than a predefined delayed insertion threshold, it is inserted into the prefix tree (delayed insertion). Meanwhile, if the support of the itemset that is already managed in the prefix tree is decreased below a pruning threshold at a certain point of time, the corresponding itemset is determined as a minor itemset that is unlikely to be a frequent itemset and removed from the prefix tree in memory (pruning). Through the two operations (delayed insertion and pruning), the size of the prefix tree is reduced for the management of the frequency counts of the itemsets. An additional characteristic of the estDec is to let newly generated transactions to be reflected sufficiently on the mining results by giving different weights to transactions of the data stream indefinitely increased based on the generation times of the transactions.
The conventional researches have been introduced for finding frequent itemsets; however, they have the following technical limitations.                Limitations in the Basic Mining Method        The conventional methods have been designed to efficiently acquire mining results by predefining the data sets to be mined prior to the data mining process in the case where a basic statistical pre-processing analysis for the data sets is available. However, under the circumstances that the items that constitute a data set may be changed and the data sets are increased continuously, it is impossible to definitely define the items constituting the data set and the transactions thereof and, accordingly, it is impossible to carry out the basic statistical pre-processing analysis for the data sets.        The conventional mining systems aim at providing analyzed information for fixed data sets. Accordingly, it cannot provide the new changes caused by the changes of the data sets with the addition of new data to users promptly.        Limitations in Decreasing Time for Mining Process and the Real Time Process        The conventional methods require long processing times to obtain analysis results including newly generated information over the data sets increased continuously. That is, if the data sets are expanded under the circumstances that new transactions are generated continuously, the previous analysis results become the past information and their worth as recent information including the whole data sets generated up to now is decreased. Accordingly, to acquire a recent analysis result including newly generated data sets, the mining process should be carried out again for a portion or the whole of the previous data sets and for all newly generated transactions. That is, it has some drawbacks in that the mining process should be performed repeatedly and the corresponding data sets become larger, thus prolonging the processing time.        The conventional methods have the limitations in obtaining mining results in real time. The real time processing capability denotes a capability of acquiring an analysis result promptly within a given time period. The conventional methods pay regard only to the accurate information analysis for the data sets to be analyzed, thus having limitations in ensuring a prompt processing time. In particular, the conventional methods should accumulate all previous transactions separately under the circumstances that the data sets are continuously increased to read repeatedly the respective transactions constituting the data set. Moreover, since the processing time for obtaining a mining result including information of newly generated data sets is increased, they have been limitations in obtaining analysis results in real time. That is, since the conventional methods have been designed to obtain mining results based on the addition of newly generated transactions through an analysis for the whole data sets, they cannot provide mining results based on the addition of new transactions in real time.        Limitations in the Process Using a Limited Memory Space        The conventional methods can predict the memory usages required for the mining process in order to find frequent itemsets over a limited memory space, on the assumption that the data sets to be mined are predefined. However, in the real time mining over data streams, the data sets are not predefined and further it is impossible to predict the memory usage since the data sets are continuously increased.        The conventional mining methods over data streams store summary information for data sets in memory using various data structures such as tree and utilize the summary information for finding frequent itemsets. The estDec prefix tree and the Loss Counting's management structure of the frequency counts of itemsets are directed to the data structure storing and managing the summary information. Accordingly, if the amount of the summary information stored during the mining process exceeds a given memory space, it is impossible to proceed with the mining operation in memory.        