1. Technical Field
The present invention relates to data mining that analyzes a large amount of data sets to find specific information, and more particularly, to a method and apparatus for finding maximal frequent itemsets over data streams configured of continuously generated transactions.
2. Related Art
In recent years, unlike existing database processing technology for processing limited data sets, a study on data stream processing technology for processing data streams has actively progressed. The data stream is defined as an infinite set consisting of continuously generated data. Therefore, it is impossible to separately store all the generated data objects in a limited space. Several conditions should be satisfied in order to extract knowledge on the information of the data stream in consideration of the above-mentioned features. First, the mining results should be generated by reading each transaction information only once. Second, no matter how infinitely new data is continuously generated in order to analyze the data streams, the new data should be able to be processed in a physically limited memory space. Third, the newly generated data objects should be processed as rapidly as possible so as to provide the results. In order to satisfy the above-mentioned requirements, the mining results generated by the mining methods for the data streams inevitably include minor errors.
Generally, the frequent items are found by selecting all the items having support larger than a specific support threshold in the limited data set. Since the method for finding the frequent items in the data stream environment is impossible to maintain all the previously generated transaction information, the frequent itemsets or the appearance frequency obtained from the mining results may include minor errors. In the data stream environment, a count sketch algorithm, which is one of the methods for finding frequent items, focuses on a support finding of unit items and estimates the appearance frequency of unit items in transactions generated up to a predetermined time to generate a set of unit items satisfying the threshold or more. On the other hand, a Lossy counting algorithm finds a set of frequent items equal to or more than the minimum support within an allowable error when the minimum support and the maximally allowable error are given. The transactions newly generated in the data streams is filled in a buffer having a predetermined size within the memory and are processed in one lump and manages the frequency of items, such as which are likely to be the frequent items, and the errors of each items. Since the Lossy counting algorithm can process many transactions newly generated when the size of the buffer becomes large, it reduces the number of batch operations, while relatively increasing the memory usage.
There are specialized algorithms for finding a closed frequent itemset and a maximal frequent itemset of the frequent itemsets. Chi et al. (U.S. Publication No. 2006/0174024 A1) proposed a MOMENT algorithm for mining the closed frequent itemsets in the data streams. The MOMENT algorithm uses a tree based structure, which is called a Closed Enumeration Tree (CET), within the memory, to validate transactions on a sliding window to manage the closed frequent itemsets appearing on the stream. A closed frequent itemset (CFI)-stream, which is an algorithm mining and finding the closed frequent itemsets in a similar manner to the MOMENT algorithm, slightly improves the memory usage and the consumed time over the MOMENT algorithm by using a simplified Direct Update (DIU) tree structure.
The maximal frequent itemsets indicate individual frequent items having the longest length among the frequent itemsets. In order words, if the support of the itemsets is equal to or more than a user minimum support and there is no frequent superset, the maximal frequent itemsets are called “maximal frequent”. In the data stream environment, a method for finding maximal frequent itemsets may include a data stream mining for maximal frequent itemsets (DSM-MFI) algorithm. The DSM-MFI method proposes using an SFI-forest (Summary Frequent Itemset forest) that expands a prefix tree structure. The SFI-forest manages a list of the frequent itemsets configured to have a suffix tree structure in a landmark window that targets data from a specific point in time to a current point in time. Another method for finding the maximal frequent itemsets may include an INSTANT algorithm. The INSTANT algorithm uses an array structure so as to management the itemsets, unlike the methods in the related art that configures and updates a tree. The INSTANT algorithm stores all the itemsets for each frequency i appearing up to now in each array U[i] and if the frequency is updated, shifts the corresponding itemsets to other arrays, that is, U[i+1]. At this time, if other itemsets, which are a subset of the shifted itemsets, exist in the corresponding array, they are removed. Thereby, the itemsets having the longest length are stored in the array for each frequency. Among these array structures, the itemsets whose frequency is equal to or more than the minimum support outputs their results as the maximal frequent itemsets and are removed from the array. Since the INSTANT algorithm individually compares all of the itemsets and manages both the longest itemsets and the frequency, it is assured that the accurate results are output at any time. However, since all the itemsets should be maintained in the comparison process, the performance time and memory usage are excessively consumed. As a result, this INSTANT algorithm may be inappropriate for the data stream environment