1. Technical Field
The present invention relates to a data mining technique that analyzes a massive data set to find desired information, and in particular, a method and apparatus for finding a cluster in a data stream as an infinite data set having data elements to be continuously generated.
2. Related Art
Recently, with the increase of application environments that process massive real-time data, such as Web surfing logs, communication logs, real-time movie data, and real-time stock trading, data stream analysis is increasingly needed. For this reason, there have been many researches on data mining for the data stream analysis.
Clustering is one of data mining analysis algorithms, in which multiple data elements are classified into similar groups by a given similarity measure. Clustering has been efficiently used in various applications, such as satellite image analysis, customer analysis for target marketing, and massive log analysis for unauthorized access detection.
For a finite data set, most of the existing clustering algorithms minimize the processing time and the memory usage while maintaining the accuracy of the analysis result. Although the clustering algorithms have been used in various applications, there are few clustering algorithms that are used to process a massive data set.
The known clustering methods have the following technical limitations.
The known techniques are designed to predefine a target data set before data mining, such that the analysis result can be efficiently obtained only when basic statistical preprocessing analysis can be performed on the target data set. However, the knowledge in the data set changes as time goes by, and in an environment where the data set is continuously increased, it is not possible to finitely define a data set. The known data mining system is to provide analysis information of a fixed data set. Accordingly, it is not possible to rapidly provide a user with a change in the analysis information when new data is added and the data set is changed.
In the known techniques, it takes a lot of time to obtain an analysis result including latest information in the data set, which is continuously increased. That is, in an environment where new data elements are continuously generated, when the data set is expanded, the previous analysis result becomes old information, and it is not useful as latest information, which includes information on the entire data set, in which a new data element is generated until recently. In addition, to obtain the analysis result including a data set, in which new data elements are generated, part or all of the previous data set and newly generated data elements need to be clustered again. Accordingly, mining is inevitably performed several times, and the size of the data set becomes large, which makes the processing time longer.
In addition, the known techniques have a limitation to obtain a cluster in real time. Upon clustering, a real-time processing capability refers to a capability to rapidly obtain the analysis result within a limited time. The known technique takes accurate information analysis on a data set to be analyzed into consideration, and thus there is a limitation to support a fast processing time. In particular, since the data set needs to be recursively read several times, in the environment where a data set is continuously increased, the previous data elements need to be separately stored, and it takes a lot of time to obtain the analysis result including information on a data set, in which a new data element is generated. For this reason, there is a limitation to obtain the analysis result in real time. That is, it is designed such that, when a new data element is generated, the clustering result can be obtained only by analyzing the entire data set. Therefore, the clustering result when a new data element is added cannot be provided in real time.
In the known techniques, to find clusters in a limited memory space, the memory usage required for clustering can be predicted in advance on an assumption that a data set is predefined. However, in case of the real-time data mining in the data stream environment, since a data set is not predefined, and data is continuously increased, the required memory usage cannot be predicted in advance.