1. Field of the Invention
The present invention relates to a data clustering method and, more particularly, to a density-based data clustering method.
2. Description of the Related Art
Traditionally, the data clustering method is primarily based on the density of the data points. For example, based on a defined radius and a minimum threshold value of data point, if the density of data points of a certain area meets a required condition, a searching operation is performed for the area. Based on this, all areas that meet the required condition will be determined and merged together to obtain a resulted data clustering. The known representative data clustering methods comprise DBSCAN, IDBSCAN and FDBSCAN, as illustrated below:
1. DBSCAN Data Clustering Method:
The method was proposed by M. Ester et al. in 1996, as described below. The first step is randomly selecting one data point from a plurality of data points of a data set as an initial seed data point. The second step is determining whether the number of the data points within a circular range radially extended from the initial seed data point with a radius of R exceeds the minimum threshold value. If so, the data points within the range is categorized as the same cluster and are used as seed data points, and the extension operation is subsequently applied to other seed data points within the circular range. The third step is re-performing the previous second step until all data points of the data set are categorized. The traditional DBSCAN data clustering method performs the data clustering based on the density, so it is capable of filtering the noise data points (the data points with low density) and suitable for the irregular-patterned data points. However, this mechanism takes a considerable time for operation as all data points require the calculation of data point density within their own searching ranges. In addition, the method requires the calculation of the distance between a core point and each data point, an increased time consumption for data clustering is therefore inevitable.
2. IDBSCAN Data Clustering Method:
The method was proposed by B. Borah et al. in 2004 and improves upon the DBSCAN data clustering method by reducing the required data queries. The method creates 8 symbols on the border of the searching range radially extended from a seed point with a radius of R, with the 8 symbols spacing from each other evenly. Based on this, the number of the seed points is reduced by selecting only the data points that are close to the 8 symbols as seed points. As a result, the number of seed points is reduced and the time consumption is therefore reduced. However, the time reduced is limited. In general, although the IDBSCAN data clustering method reduces the seed points within a searching range of a radially extended seed point as lower as no more than 8 seed points, however, the time consumption is still increased for the seed points that are close to the extended seed point, as the seed points would have larger coverage. In addition, although the seed points within the searching range of an extended seed point is no more than 8, the time consumption is still considerable as the searching ranges of two adjacent seed points would overlap, resulting in a repeated extension of the seed points.
3. FDBSCAN Data Clustering Method:
The method was proposed by BING LIU et al. in 2006 and also improves upon the DBSCAN data clustering method by reducing the required data queries. The FDBSCAN method determines whether to merge two overlapped clusters into the same cluster according to the data points located within the overlapped area therebetween. Specifically, assume a first cluster is overlapped with a second cluster, with a plurality of data points located within the overlapped area of the first and second clusters. Based on this, the FDBSCAN method determines whether the number of data points of a searching range of any data point located within the overlapped area is greater than the minimum threshold value. If so, the first and second clusters are merged into the same cluster. In this way, the number of times of searching operation for data points is reduced, thereby improving over the DBSCAN method. However, the time reduced is limited.
In summary, although the above-mentioned data clustering methods are capable of efficiently detecting the irregular patterns and filtering the noise points, the required time for data clustering operation is considerable.
Therefore, there is a need to improve the above data clustering methods.