1. Field of the Invention
The present invention relates to a heterogeneous data cluster generation apparatus and method and a data clustering method and apparatus, and more particularly, to a data clustering method and apparatus which cluster data measured by different sensors into a number of groups.
2. Description of the Related Art
A cluster is a group in which similar data among numerous data are gathered together, and clustering is to classify numerous data into a number of groups according to similarity.
In conventional cluster-based clustering methods such as K-means, K-medoids and canopy, when new data is input, distances between the new data and all clusters are calculated to find a cluster closest to the input data. Then, the new data is included in the found cluster. In the conventional clustering methods, however, the amount of calculation required significantly increases when the number of clusters increases as the size of data increases. If the number of clusters is reduced to overcome this problem, the data lose their original characteristic information, making it difficult to accurately identify the data.
Of the conventional clustering methods, a clustering method using a hierarchical algorithm such as K-D Tree does not require distance calculation for all clusters. However, if the number N of dimensions becomes greater than 10, the number of nodes to be searched in a space increases geometrically, thus slowing down calculation. In addition, since the hierarchical algorithm such as K-D Tree is not balanced, nodes should be rearranged periodically in order to strike a balance between the nodes.
Also, scattered data cannot be effectively clustered using the conventional clustering method. If the scattered data are clustered using the conventional clustering method, different clustering results may be produced every time. Therefore, if the scattered data are clustered using the conventional clustering method, re-clustering may be frequently performed during clustering, which, in turn, increases the amount of calculation required.
To reduce the amount of calculation, a technology of reducing the dimension of data may be used. In this case, however, the data may lose information, and outlier data of the reduced dimension cannot be identified. Thus, accurate clustering is difficult.
Furthermore, systems, such as a building energy management system (BEMS), which measure various data using numerous different types of sensors are increasing. However, a technology of generating clusters by putting together various data measured by numerous different types of sensors is not available, and a technology of rapidly and effectively clustering various data continuously measured by numerous different types of sensors is also not available. The absence of such technologies is because data measured by numerous different types of sensors in, e.g., the BEMS are massive and scattered data, and thus it is difficult to cluster the data rapidly and accurately. Accordingly, this has led to a demand for a technology of generating clusters by putting various heterogeneous data together and a technology of effectively clustering various newly input data.