In recent, a variety of high-performance sensors having different types and functions are closely connected with a living environment and are variously distributed, such that an amount of information that can be obtained from the sensors has been rapidly increased. Therefore, applicability and necessity of existing mining technologies for a huge amount of real-time information (data) are considered as important issues. Real-time data stream mining technologies in various methods have been proposed.
As a representative algorithm finding frequent itemsets from a finite set of transactions, an Apriori algorithm has been proposed in <R. Agrawal, R. Strikant, “Fast Algorithms fir Mining Association Rules,” In Proceeding of the 20th International Conference on Very Large Database, pp. 487-499, 1994.>. The above-mentioned Apriori algorithm generates a candidate set n times in order to find frequent itemsets having a length of n and attempts transaction information finding n+1 times, such that memory usage is very large and the time consumed to search becomes long. Further, a Carma algorithm has been proposed in <C. Hidber, “Online Association Rule Mining,” In Proceedings of the 21st International Conference on Very Large Data Bases, pp. 432-444, 1995.>. The Carma algorithm searches the transactions in the data set through a two-stage processing process to find the frequent itemsets. An algorithm of finding frequent itemsets targeting the fixed data set is inappropriate as a mining method under a real-time data stream environment as a definition timing of an analysis object should be before mining and scanning working is necessarily performed once or more.
In an environment where the data set is gradually increased, synthetic mining results for a newly updated data set can be obtained using a gradual frequent itemset mining algorithm such as an FUP-based algorithm that is disclosed in <D. Cheung, J. Han, V. Ng, and C. Y. Wong, “Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique,” In Proceedings of the 12th international Conference on Data Engineering, pp. 106-114, 1996>, <D. Cheung, S. D. Lee, and B. Kao, “A general Incremental Technique for Maintaining Discovered Association Rules,” In Proceedings of the 5th International conference on Databases Systems for Advanced Application, pp. 185-194, 1997.>. Since the gradual mining algorithm may use previous transaction information in order to obtain latest results but should store all of the information on each of the transactions and find the previous transaction in order to accurately calculate support, it is inappropriate as a method of data streams. In a Lossy Counting algorithm proposed in <G. S. Manku and R. Motwani, “Approximate Frequency Counts over Data Streams,” In Proceedings of the 28th International Conference on Very Large Databases, pp. 346-357, 2002>, the frequent itemsets are found by limiting the memory usage to a predetermined range during a process of finding frequent itemsets. However, in order to obtain high efficiency in the Lossy Counting algorithm, a memory space should be used in proportion to the efficiency, which affects an increase in mining run time.
In order to efficiently find the frequent itemsets in the real-time data stream environment, an estDec algorithm has bee proposed in <Joong Hyuk Chang, Won Suk Lee, “Finding recent frequent itemsets adaptively over online data streams,” In proceedings of the 9th ACM SIGKDD international conference on Knowledge Discovery and Data Mining, pp. 487-492, 2003.>. The estDec algorithm processes the transactions configuring the data streams as soon as the transactions are generated and manages an appearance frequency of itemsets appearing in the transactions by using a monitoring tree having a prefix tree structure without generating a candidate set for generating the frequent itemsets. The estDec algorithm maintains high efficiency by managing only the important itemsets that are likely to become frequent itemsets through delay addition and pruning.
The above-mentioned various kinds of mining methods may easily derive meaning items included in given information, but is not actually easy to detect ready-to-use semantic information included in the given information. Therefore, various methods have been proposed in order to derive the semantic information. Among those, as a method of using a data stream based clustering mechanism, an MC-Stream method that derives meanings through an abstraction step based on event information previously defined by a user has been proposed in <YongChul Kwon, Wing Yee Lee, Magdalena Balazinska, “Clustering Events on Streams using Complex Context Information,” In Proceedings of the IEEE International Conference on Data Mining Workshops, pp. 238-247, 2008>. The method, which replaces a distance measuring method with a semantic-based measuring method based on a representative stream based clustering algorithm, measures the semantic distance using similarity such as predefined time, belongings, etc., to generate clusters.
In addition to the above-mentioned algorithms, various real-time data stream mining technologies have been proposed, but a plurality of algorithms in addition to the above-mentioned algorithms that can actively detect the semantic information included in the real-time data streams, that is, the semantic information on the real-time state of the corresponding domain and use the information are insufficient. Most of the semantic approach methods merely use predefined context information from a user or predict current context through previous information. Therefore, there is no method for solving a problem caused when basic information is not present.