1. Field of the Invention
The present invention relates to a data clustering apparatus and method. More particularly, the present invention relates to a data clustering apparatus and method, which can rapidly and accurately cluster data.
2. Description of the Related Art
A cluster means similar data among numerous pieces of data being clustered into a group. Clustering refers to a technique of classifying data having similar features among numerous pieces of data into multiple groups.
In the existing cluster-based clustering method, such as K-Means, K-Medoids, or Canopy, if new data is input, a distance between the new data and each of all clusters is computed to find a cluster that is closest to the input data to then be clustered.
However, the existing clustering method poses a problem that a computational quantity may greatly increase according to an increase in the data size. In order to overcome the problem, the number of clusters may be reduced. In this case, however, feature information of original data may be lost, making it difficult to achieve accurate data analysis.
Among the existing clustering methods, a clustering method using a hierarchical algorithm, such as a K-D Tree, does not require distance computations for all clusters. In the clustering method using a hierarchical algorithm, however, if the number N of dimensions becomes greater than 10, the number of nodes to be searched for a space may drastically increase, resulting in slow computation. In addition, since the hierarchical algorithm, such as K-D Tree, is not well balanced, nodes should be periodically rearranged to establish node-to-node balance.
In addition, according to the existing clustering methods, there exist scattered pieces of data, which are not effective in clustering. If the scattered pieces of data are clustered using the existing clustering method, inconsistent clustering results are obtained each time clustering is performed. Therefore, in a case of clustering the scattered data using the existing clustering method, there is an increasing possibility of re-clustering during clustering, resulting in an increase in the computational quantity.
In order to reduce the computational quantity, a data dimension reducing technique may be used. In this case, however, there may be a data loss and outlier data of a reduced dimension cannot be discriminated, making it difficult to achieve accurate clustering.
Furthermore, like in a building energy management system (BEMS), there are increasing cases of measuring various pieces of data using many different types of sensors. However, there are few techniques for creating clusters by combining various pieces of data measured by many different types of sensors. Moreover, there are few techniques for rapidly and effectively clustering various pieces of data continuously measured by many different types of sensors.
As described above, since the data measured by many different types of sensors in such a place as the BEMS is large-scale data and scattered data, rapid, accurate clustering is difficult to achieve. Accordingly, it is necessary to propose techniques of creating a cluster by combining a variety of pieces of data of different types and techniques of effectively clustering newly input data of different types.