1. Field of the Invention
The present invention relates to a data clustering method, particularly to a method for grid-based data clustering.
2. Description of the Related Art
With the progress and development in information technology, the number of data stored in a database is substantially increasing. “Data mining” is generally utilized in the field of data management to identify useful information hidden in data stored in a database and to draw concealed features of and relationships between the said data, so as to establish a data-analyzing model. Besides, through data clustering method of data mining, degree of correlation between the data can be quickly obtained, and thus data with great similarity of a feature can be identified as in the same cluster. Presently, there are kinds of data clustering methods being widely used, and two general kinds of them are now introduced as the following.
“K-means” data clustering method was proposed by McQueen in 1967, which is a data clustering method based on partitioning operation and processed by the following steps. For the first step thereof, cores “x” with a total of “k” are randomly selected from all data within a database, with the number “k” being the required number of resulting clusters. For the second step, distances between the cores and other data in the database are verified, and then each of the said other data is designated to be in a cluster containing the nearest one of the cores according to the distance verifying result. For the third step, after all the data are designated, a new core for each cluster is determined by finding a datum located closest to a center of a cluster and checked whether the new core and the original core of the cluster are identical, and replaces the original core while they are different. After the third step, the second and third steps operate again if the new core and original core in any one of the clusters are different, and the whole data clustering process terminates if all the centers in the clusters are settled. The primary advantage of K-means data clustering method is high clustering speed even though there is a great number of data in the database. However, owing to the originally and randomly selected cores “x,” the K-means data clustering method may easily lead to different clustering results for the same database. Namely, the clustering result of the K-means data clustering method is unstable. Besides, because the way for designating a datum to a cluster merely depends on the result of distance-comparison between the cores and the said other data, result in clustering accuracy of K-means clustering process is usually not ideal.
Another data clustering method, named DBSCAN data clustering method, is proposed by M. Ester et al. in 1996, which is a data clustering method of density-based operation and processed by the following steps. Regarding to the first step thereof, a core point is randomly selected from all data points within a database. For the second step, a number of data points within an area of a searching radius and centered at the core point is counted, which is called a search action, to identify whether the number of the data points in the area exceeds or equals to a threshold value. If the said number is less than the threshold value when the search action of the core point finishes, the core point will be regarded as a noise data. Alternatively, if the said number is larger than the threshold value, the data points in the area are designated as in the same cluster, and then other data points in the area go through the said search action to extend the cluster. Besides, the cluster keeps extending until a number of data points within any area is less than the threshold value. For the third step, data points other than those having been designated are identified and then go through the above-mentioned first and second steps till each of all the data points is designated to a cluster or regarded as a noise point. This conventional DBSCAN data clustering method is good in noise filtering and suits database with irregularly arranged data points. However, because the said search action has to be proceeded for every data points, a long processing time is unavoidable and fatal.
Accordingly, in order to solve the unstable clustering result or the long processing time, the above-mentioned data clustering methods are further improved therefore.
An “ANGEL” data clustering method of grid-based operation, a combination of partitioning operation and density-based operation, is recently proposed. The ANGEL data clustering method comprises steps of: creating a feature space having a plurality of cubes and disposing a plurality of data stored in a database into the cubes, and then defining some of the cubes as populated cubes according to the number of data disposed in the cubes; identifying whether the data within each of the populated cubes being evenly distributed or not, and defining the populated cubes having evenly distributed data as major cubes and those having unevenly distributed data as minor cubes; detecting the minor cubes by the DBSCAN data clustering method to search for border data disposed near borders of each minor cube, and then comparing the border data with the data in the major cubes to combine at least one of the border data with the data in the major cubes; and designating all the data combined with each other as in the same cluster and recursively processing the above procedures to cluster all the data stored in the database. In comparison with the K-means data clustering method, the ANGEL data clustering method is better in result stability and noise filtering. Besides, in comparison with the DBSCAN data clustering method, the ANGEL data clustering method can process faster. However, it is difficult for a user to determine initial parameters required for processing the ANGEL data clustering method according to various purposes and needs.
Therefore, a G-TREACLE data clustering method is then proposed by the inventor of the ANGEL data clustering method, which comprises density-based, grid-based, and hierarchical operations to improve the ANGEL data clustering method by replacing the DBSCAN data clustering method therein with the said hierarchical operation. In detail, being similar to the initial steps of the ANGEL data clustering method, the G-TREACLE data clustering method also defines populated cubes in the same way. However, instead of identifying major and minor cubes form those populated cubes, this method defines a “Dynamic-Gradient-Threshold (DGT)” value to filter out noise data and thus identifies some of the populated cubes as border cubes that have border data of any cluster. And then, a searching radius and a threshold value are given for the data in each border cube to complete the hierarchical operation, and, finally, data in the same cluster are identified and grouped. Although speed of processing of this method is faster then that of the ANGEL data clustering method, there are still too many parameters waiting for a user to determine.
As a result, regarding to the above two enhanced data clustering methods, even if performances in clustering accuracy and processing speed are improved, they are still inconvenient for use owing to the parameter-determination. Hence, there is a need of improving the conventional data clustering methods.