1. Field of the Invention
The present invention relates to a density-based data clustering method and, more particularly, to a data clustering method that performs a data clustering operation dependent on the local data point density of a data set.
2. Description of the Related Art
Traditionally, the data clustering method is primarily based on the density of the data points. For example, based on a defined radius and a minimum threshold value of data point, if the density of data points of a certain area meets a required condition (that is, the number of the data points is higher than the minimum threshold value), an extension and searching operation is performed for each data point located in the area. Subsequently, the areas that meet the required condition will be determined and merged together to obtain a resulted data cluster. The known representative data clustering methods comprise DBSCAN and IDBSCAN, as illustrated below:
1. DBSCAN Data Clustering Method:
The first step of the method is selecting one out of a plurality of data points from a data set in a random manner, with the selected data point being regarded as an initial seed data point. The second step is determining whether the number of the data points within a circular range, that is radially extended from the current seed data point with a radius of R, exceeds the minimum threshold value. If so, the data points within the range are categorized as the same cluster and regarded as new seed data points. The third step is repeating the previous second step using the new seed data points until all data points of the data set are categorized. The traditional DBSCAN data clustering method performs the data clustering based on the density of data point, so it is capable of filtering the noise data points (the data points with low density) and suitable for the irregular-patterned data points.
2. IDBSCAN Data Clustering Method:
The method improves upon the DBSCAN data clustering method by reducing the number of times of the extension and searching operations performed for the numerous data points. The method simply creates 8 symbols on the border of a circular range that is radially extended from a seed data point with a radius of R, with the 8 symbols spacing from each other evenly. Based on this, the 8 data points closest to the 8 symbols within the circular range are determined and regarded as seed data points. Therefore, the number of the seed data points is greatly reduced, thus reducing the time consumption.
Although the above traditional data clustering methods are capable of filtering the noise data points and suitable for the irregular-patterned data points, however, the data point density within the resulted data cluster may not be even. The traditional data clustering methods are not able to further cluster the data points within the resulted data cluster based on the local data point density. To further cluster the data points within the resulted data cluster, a DD-DBSCAN data clustering method, which improves upon the previously-described data clustering methods, was later proposed, as described below.
3. DD-DBSCAN Data Clustering Method:
The method mainly improves upon the traditional DBSCAN method. The method defines three parameters: a scanning radius R, a minimum threshold value (for data points) and a tolerance index α. The first step of the method is selecting one out of a plurality of data points from a data set in a random manner, with the selected data point regarded as an initial seed data point. The second step is determining whether the number of the data points within a circular range, that is radially extended from the current seed data point with a radius of R, exceeds the minimum threshold value. The third step of the method is selecting one data point other than the initial seed data point from the circular range as a reference data point and determining whether the number of the data points within a searching range of the reference data point is higher than the minimum threshold value. If so, all data points within the searching range of the reference data point are defined as secondary seed data points. In a fourth step of the method, it is determined whether the number of data points within a searching range of each secondary seed data point is higher than the minimum threshold value. If so, it is determined whether the data point density of the searching range of each secondary seed data point is same as that of the reference data point. If the data point density of the searching range of each secondary seed data point is same as that of the reference data point, all data points located in the searching ranges of the reference data point and the initial seed data point are clustered together as a data cluster and treated as seed data points. The fifth step of the method is repeating the previous third and fourth steps until all seed data points are finished. The sixth step of the method is repeating the previous first through fifth steps until all data points of the data set are clustered.
However, although the traditional DD-DBSCAN method is capable of performing data clustering operation according to the local data point density, it takes a considerable time for operation. Therefore, there is a need to improve the above data clustering methods.