1. Field of the Disclosure
The disclosure relates to a data clustering method for clustering similar data points into the same data cluster, a data clustering device, a data processing apparatus and an image processing apparatus using the same.
2. Description of Related Art
Data clustering is a multivariate data analysis technique in mathematical statistics, which is an unsupervised data analysis method. A main purpose of the data clustering is to group original data into clusters and find a representative point of each cluster, so as to reduce a data amount and reduce complexity of data analysis. The data clustering method is to cluster the data having a similar feature into the same cluster according to a feature distribution of the data, so as to cluster the whole data into a plurality of clusters, and then the data of different clusters can be further analysed. The data clustering is widely used in various domains such as data mining, pattern recognition, market segmentation, cell formation problem and bioinformatics popular in recent years.
The current data clustering techniques can be divided into two types of a partitional clustering technique and a hierarchical clustering technique.
A basic principle of the partitional clustering technique is to reduce a distance square error of each data point and a cluster center in the cluster. Regarding a partitional clustering method, in case that n objects (or data) are provided and the number (for example, K) of clusters to be partitioned is predetermined, an initial partition result is first generated according to the K value, and then an iterative relocation technique is used to move the objects from an original cluster to the other clusters, so as to improve the partition result. Generally, according to a good partition result, the objects within the same cluster are closed or similar to each other, and the objects of different clusters are remote or different to each other. A most famous partitional clustering technique is a K-means clustering technique provided in 1967. According to the K-means clustering technique, k cluster centers are randomly selected first, and then cluster gravity centers or the cluster centers are used to respectively group all of the data objects to the clusters most similar to the cluster centers, and after all of the data are grouped, the cluster centers are recalculated, and the above steps are repeated until a value of the cluster center of each cluster is not changed.
The hierarchical clustering technique is generally presented through a tree structure, and the data clustering can be achieved by dividing or agglomerate the data layer-by-layer, the hierarchical clustering technique can be further divided into a hierarchical agglomerative clustering technique and a hierarchical divisive clustering technique according to different generation methods of the tree structure.
Regarding the hierarchical agglomerative clustering technique, the clustering is implemented through a bottom-up manner. According to the hierarchical agglomerative clustering technique, each batch of data is regarded as a cluster at an initial stage. In other words, if there are n pieces of data, there will be n clusters at the initial stage. Then, the clusters are agglomerated from the bottom of the tree structure. During each agglomeration, two closest clusters are agglomerated into a new cluster until the number of the clusters is complied with a predetermined value. Therefore, the number of the clusters is reduced by one during each agglomeration until the final cluster is generated. Assuming m agglomerations are performed, the number of the clusters is then changed from n to (n-m).
Regarding the hierarchical divisive clustering technique, the clustering is implemented through a top-down manner. According to the hierarchical divisive clustering technique, all of the data is regarded as one cluster at the initial stage, and a new cluster is generated during each division until the number of the clusters is complied with a predetermined value. In other words, if there are n pieces of data, there will be only one cluster at the initial stage, and after m divisions, the number of the clusters is changed from 1 to m+1.
Therefore, all of the current data clustering techniques require the iterative operation to group the data into the suitable data clusters. However, in an embedded system, a memory space is limited, so that there is no enough memory space for storing feature values of the data points required to be recorded when the aforementioned data clustering technique is executed. For example, assuming the number of the data points for clustering is 16001, and each data point requires an integer number of 4 bytes to store the feature value. Therefore, when the aforementioned data clustering technique is used for data clustering, 64004 bytes memory storage space is required. However, in general, a size of the memory space of the current embedded system is only 64K. Therefore, during each iterative operation, additional content switching is required to switch the original data in the memory into unprocessed data, and then the clustering operation can be completed. Such content switching may have increasing data transmission cost as the number of the data points is increased, so that performance of the embedded system is influenced.