1. Field of the Invention
The present invention generally relates to a data clustering method and, more particularly, to a grid-based data clustering method.
2. Description of the Related Art
As technology continues to grow, a larger and larger amount of data can be stored in a database. Through provision of data mining technology, a user is allowed to dig out useful information from an original data having a plurality of data sets, so as to find out implicit characteristics and relations among the plurality of data sets. Data clustering methods provided by the data mining technology allows one to quickly recognize intrinsic correlations among a plurality of data. The data with high similarities are clustered together as the same cluster based on customized dimensional characteristics. Nowadays, there are a variety of data clustering methods, such as division-based data clustering method, density-based data clustering method, grid-based data clustering method, hierarchical data clustering method, etc. The following representative data clustering methods are described below.
A. DBSCAN data clustering method. The method is a density-based data clustering method that was proposed by M. Ester et al. in 1996. In a first step of the method, one of a plurality of data points contained in a data set is randomly selected as an initial seed. In a second step, it is determined whether the quantity of the data points contained in a circular coverage, which is expanded from the initial seed in a radius, is larger than a threshold value. If so, all data points contained in the circular coverage are clustered as a cluster and acknowledged as seeds. The same expansion operation of the initial seed is performed on each of the seeds to gradually expand the cluster. In a third step, the second step is repeatedly performed until all data points in the data set are clustered. Because the method performs data clustering operations based on density of data points, the method can filter noise data points (the data points with low density) and can be applied to data points with an irregular pattern. However, it takes considerable time to cluster all data points as every data point requires the same density determination, leading to long execution times. In addition, it is also difficult to choose the parameter values.
B. IDBSCAN data clustering method. The method was proposed by B. Borah et al. in 2004, aiming at improving the DBSCAN data clustering method. In a first step of the method, one of a plurality of data points is randomly selected as an initial seed. In a second step of the method, 8 representative points are arranged on an expanded range of the initial seed, and added to a seed list as seeds in order for an expansion operation to be performed on the seeds. In a third step of the method, the second step is repeatedly preformed until all data points are clustered. The IDBSCAN data clustering method does efficiently reduce the time consumption of the DBSCAN data clustering method. However, the amount of time saved is still limited as the density determination is still required for the 8 representative points.
Generally, the above conventional data clustering methods have been criticized for long operation times and difficulties in determining parameter values. In light of this, a number of data clustering methods were proposed to overcome the defects of the aforementioned data clustering methods. Here, the GOD-CS data clustering method is taken as an example for illustration purpose.
As proposed in Taiwan Patent Publication No. 201107999 entitled “GRID-BASED DATA CLUSTERING METHOD”, the GOD-CS data clustering method is a grid-based data clustering method which incorporates the conventional density-based data clustering method with the division-based data clustering method. The GOD-CS data clustering method improves upon the conventional ANGEL and G-TREACLE data clustering methods. In a first step of the GOD-CS data clustering method, a space containing a data set having a plurality of data points is divided into a plurality of grids according to a given grid quantity. In a second step of the GOD-CS data clustering method, a high-density grid that has not yet undergone an expansion operation is determined based on a density determination rule. The high-density grid is taken as an initial grid and added to a seed list as a seed. In a third step of the GOD-CS data clustering method, a seed is selected from the seed list in order to determine whether the selected seed is a high-density grid or low-density grid. If the selected grid is a high-density grid, the procedure proceeds to a next step. If the grid is a low-density grid, the seed is deleted from the seed list and the third step is re-performed. In a fourth step of the GOD-CS data clustering method, all data points in the seed are clustered together as the same cluster, and the surrounding grids of the seed that have not yet undergone the expansion operation are added to the seed list as seeds. Then, the central seed is deleted from the seed list, and the third step is re-performed. The procedure proceeds to a fifth step after all seeds in the seed list are processed. In the fifth step of the GOD-CS data clustering method, it is determined whether all high-density grids have already undergone the expansion operation. If so, the procedure is terminated. If not, the third step is re-performed. In contrast to the conventional ANGEL data clustering method, the GOD-CS data clustering method can reduce the time consumption and provide a convenient use through a simplified parameter setting procedure.
In the above conventional data clustering methods, a data cluster is expanded by searching 8 surrounding grids of a central grid (such as the GOD-CS method) or by searching all grids located in a horizontal or vertical direction of the central grid. When the data cluster is expanded to the grids that can be merged, grid merging is performed to improve the noise filtering rate and the data clustering accuracy. However, searching every single grid will result in repeated searches of a lot of grids, leading to long execution times and lowering the data clustering efficiency.
In light of the problem, it is necessary to provide a grid-based data clustering method with high data clustering accuracy and convenient use.