1. Field of the Invention
The present invention generally relates to a data clustering method and, more particularly, to a grid-based data clustering method.
2. Description of the Related Art
Data mining allows a user to dig out useful information from an original data having a plurality of data sets, so as to find out implicit characteristics and relations among the plurality of data sets. Based on the characteristics and relations, a completed data analysis model can be established. The data analysis model can be used in a variety of fields such as business behavior analysis, spatial data analysis, document managements, internet invasion analysis and so on. Therefore, potentially important information can be discovered for decision makers to reference. Data mining includes data clustering methods that allow a user to quickly recognize intrinsic correlations among the plurality of data, such as consumers' purchasing behavior and age-based market segmentation. Conceptually, data clustering is a mechanism that clusters those data having a high similarity to each other based on customized dimensional characteristics.
However, as the needs in diverse services and larger amounts of implicit information continue to grow, the ability to process an excessively large amount of data has become an important factor in evaluating the performance of the data clustering methods. The following representative conventional data clustering methods are described below.
A. DBSCAN data clustering method. In a first step of the method, one of a plurality of data points contained in a data set is randomly selected as an initial seed. In a second step, it is determined whether the quantity of the data points contained in a circular coverage, which is expanded from the initial seed in a radius, is larger than a threshold value. If so, all data points contained in the circular coverage are clustered as the same cluster. Then, these data points are taken as seeds. The same expansion operation of the initial seed is performed on each of the seeds. In a third step, the second step is repeatedly performed until all data points in the data set are clustered. Since the method proceeds data clustering based on density of the data points, the method can filter noise data points (the data points with low density) and can be applied to data points with an irregular pattern. However, it also takes considerable time to perform the method as each data point requires the same expansion operation and density determination, leading to long execution times.
B. IDBSCAN data clustering method. The method was proposed by B. Borah et al. in 2004, and aims at solving the problem of large time consumption of the DBSCAN data clustering method, which is caused by repeated determination and expansion operations of the seeds, by reducing the number of times of the required data inquiries. The method evenly creates 8 boundary symbols on a circumference of a circular coverage expanded from a seed in the radius. In the circular coverage, there is always a data point that is closest to a given boundary symbol. In total, 8 closest data points can be determined for the 8 boundary symbols. The method only selects the 8 data points as seeds to reduce the quantity of seeds. As such, the number of times the expansion operation is executed can be reduced. Thus, the problem of large time consumption of the DBSCAN data clustering method can be overcome. However, the amount of time saved is still limited.
C. GOD-CS data clustering method. The method is a grid-based data clustering method proposed by Tsai, C. F. and Chiu, C. S. in 2010. The method defines the grids as high-density grids and low-density grids in order to filter the noise data points. Then, a high-density grid is selected so that an expansion operation of the selected grid can be performed based on 8 surrounding grids of the selected grid. More specifically, the method sets parameters regarding grid size and tolerance value as criteria of density determination. Then, the space containing the data points is divided into a plurality of grids based on the grid size. The density of each grid is determined according to the tolerance value. When the quantity of the data points contained in a grid is larger than the tolerance value, the grid is regarded as a high-density grid. To the contrary, when the quantity of the data points contained in a grid is smaller than the tolerance value, the grid is regarded as a low-density grid. In a next step, a high-density grid is selected so that an expansion operation can be performed according to 8 surrounding grids of the high-density grid. In a final step, a low-density grid is selected to determine whether at least 5 out of the 8 grids surrounding the low-density grid are high-density grids. If so, the low-density grid is regarded as a high-density grid. If not, the low-density grid is regarded as a noise grid. In such a manner, accuracy of the data clustering method can be improved. However, the method requires the search for way many grids, leading to low performance.
In light of the problems, it is necessary to improve the conventional data clustering methods.