1. Field of the Invention
The present invention relates to data mining for finding specific information by analyzing a large amount of data set, and in particular, to a clustering method and system for finding a subcluster from a multi-dimensional data stream.
2. Description of the Related Art
Recently, an application environment for processing a massive amount of data, such as a web surfing log or communication log, real time moving picture data, real time stock trade, etc. in real time has increased and correspondingly a demand of the data stream analysis has also increased, thus, research on data mining for analyzing data streams have progressed.
Clustering, which is one of the data mining analysis methods, is a data mining technique that partitions a plurality of data elements into a similar cluster according to given similarity measure. The clustering has been used as an efficient method in various application fields, such as satellite image analysis, targeting markets by customer analysis, large-capacity log analysis for intrusion detection, etc.
The clustering method in the related art considers only the case when similar data elements forms a cluster on a space in which all the attributes of the corresponding data elements are formed, but a subspace clustering searches clusters in a space in which attributes corresponding to a subset of all the attribute sets.
The data set often includes many dimensions in real life and any dimensional value of the data element may be lost. In such a multi-dimensional data set, a group of similar data elements, that is, the cluster is associated with all the dimensions of the data set as well as other subsets of dimensions. Several subspace clustering algorithms proposed to solve the above problems, but most of them needs a multi scan or a massive calculation process for data set. As a result, these algorithms are inappropriate for an online data stream.
As a method applicable for the data stream, there is a method using a sibling tree that is disclosed in Article “Nam Hun Park and Won Suk Lee Grid-Based Subspace Clustering Over Data Streams. in: Proc. of the Sixteenth ACM Conference on Information and Knowledge Management”. This method is a grid based method, which finds n-dimensional clusters and then searches all the extendable (n+1) dimensional spaces. This method maintains a sibling list using a fine-grain grid on each node of a tree structure for searching the cluster. Therefore, as the number of dimensions of the data stream is increased, a large amount of processing time and a large sized memory are needed to maintain and search the exponentially increased grid-cells.