1. Field of the Invention
The present invention generally relates to a data clustering method and, more particularly, to a density-based data clustering method.
2. Description of the Related Art
A conventional density-based data clustering method is performed based on the density of a plurality of data points to be clustered. For example, under a given radius R and a threshold value, an area is searched and gradually expanded if the density of the data points in the area satisfies a predetermined condition. The area is gradually expanded and merged with other areas whose density of data points satisfies the predetermined condition. In this manner, the data points can be clustered. Conventionally, the representative density-based data clustering methods include DBSCAN, IDBSCAN and FDBSCAN methods. Although the conventional methods can efficiently detect the data points with an irregular pattern and filter the noise data points, great time consumption is resulted.
The representative density-based data clustering methods are described as follows.
A. DBSCAN data clustering method. The method was proposed by M. Ester at al. in 1996. A first step of the method is to randomly select one of a plurality of data points of a data set as an initial seed. A second step of the method is to determine whether the number of the data points located in a searching range, which is expanded from the seed in the radius R, is larger than the threshold value. If so, the data points in the searching range are clustered together as a cluster and regarded as seeds. Accordingly, each seed in the searching range will undergo the same expansion operation as the initial seed did. A third step of the method is to repeatedly perform the second step until all data points in the data set are clustered. The DBSCAN method can filter the noise data points and the data points with an irregular pattern because it clusters the data points based on the density thereof. However, because it is required to repeatedly perform the same density determination step for each data point, great time consumption is resulted. Moreover, because it is also required to calculate the distances between a core point and individual data points, great time consumption is resulted.
B. IDBSCAN data clustering method. The method was proposed by B. Borah et al. in 2004, and aims at solving the problem of large time consumption of the DBSCAN data clustering method, which is caused by repeated determination and expansion operations of the seeds, by reducing the number of times of the required data inquiries. The method evenly creates 8 boundary marks on a circumference of a searching range expanded from a seed in the radius R. In the searching range, there is always a data point that is closest to a given boundary mark; therefore, 8 data points closest to the 8 boundary marks can be obtained. The method only selects the 8 data points as seeds, which reduces the quantity of the seeds when compared to the DBSCAN method. As such, the number of times the expansion operation will be executed can be reduced. Thus, the problem of large time consumption of the DBSCAN data clustering method can be overcome. However, the amount of time saved is still limited.
Furthermore, because the IDBSCAN method requires selecting 8 data points closest to the 8 boundary marks as seeds, the distance between individual closest data point and its corresponding boundary mark should be calculated. Further, before the 8 closest data points are determined, a lot of distance calculation and comparison is required, leading to a limited amount of time saved.
Moreover, although the number of the seeds in the searching range expanded from an expanded seed is not more than 8, the searching ranges expanded from two adjacent seeds may overlap in a greater extent. This results in a repeated expansion operation and increases the time consumption.
Based on the above reasons, it is desired to provide a density-based data clustering method that does not use data points as seeds, thereby reducing the number of times the expansion operation is executed and avoiding distance calculation. Thus, improved data clustering efficiency can be provided.