1. Field of the Invention
The present invention relates to a method for grid-based data clustering to create a feature space, which has a plurality of cubes, by a computer to obtain an optimal result of data clustering through an operation incorporated with density-based and grid-based algorithms.
2. Description of the Related Art
Generally, “data mining” is primarily utilized in the field of data management to establish a data-analyzing model for identifying concealed features of and relationships between the data within a database. Said established data-analyzing model is suitable for several applications, such as analyses of commercial transaction, position distribution, file management, and network intrusion, so that a user can explore covered and useful information as reference sources. There are six kinds of techniques for data mining, which are clustering, classification, association, time-series, regression, and sequence, with the clustering technique being most popular in present use. Moreover, the clustering technique also has several branches, such as partitioning operation, hierarchical operation, density-based operation, and grid-based operation. However, in execution, there are some drawbacks to each of said clustering techniques as the following.
First, regarding to the partitioning operation, it is processed by steps of: determining a center of all data within a database; verifying distances between the data; and clustering the data according to the verified distances. Representative algorithms for the partitioning operation are K-means, PAN, CLARA, CLARANS etc. Although the conventional partitioning operation is powerful in clustering speed, the result of clustering is unstable and the noise data are not filtered out.
Second, regarding to the hierarchical operation, it is processed by pre-constructing a tree-like hierarchical structure and thereby decomposing the data within the database, with the tree-like hierarchical structure being build through agglomerative approach or division approach. Through the agglomerative approach, the clustering result can be obtained by combining parts of the data bottom-up; through the division approach, the clustering result can be obtained by iteratively decomposing the data top-down. Representative algorithms for the agglomerative approach are BRICH, CURE, ROCK etc, and representative algorithm for the divisive approach is CHAMELEON. However, the conventional hierarchical operation has to compare the similarity of data during combination or decomposition, which may easily cause a large amount of executing time.
Third, regarding to the density-based operation, it is processed by clustering the data in accordance with the data density of an area. For example, if the data density of an area meets the predetermined criteria, a search will be executed and extended from the area, and other areas meeting the criteria will be combined, so as to form the clustering result. Representative algorithms for the density-based operation are DBSCAN, IDBSCAN, GDBSCAN etc. Said density-based operation can detect irregular figure and filter out noise data efficiently, but it also causes a large amount of executing time.
Finally, regarding to the grid-based operation, it is processed by creating a feature space to illustrate the data within the database, dividing the feature space into a plurality of grids, and then combining adjacent grids in accordance with analysis results of the data within each grid, so as to obtain the clustering result. Moreover, instead of the datum in each grid, the minimum unit to be clustered is the grid within the feature space. Representative algorithms for the grid-based operation are STING, CLIQUE etc. The clustering speed of the conventional grid-based operation is fast due to the minimum clustered unit being a grid. However, the rectangle-shaped grids can result in imprecise clustering result or pattern with jagged edge.
Accordingly, in the conventional clustering techniques, there are several problems, such as long executing time, existence of noise data, and imprecise clustering results. Therefore, for practicability, how to maintain the advantages of and expel the drawbacks from the conventional clustering techniques is an important topic over the relative technique field. Hence, there is a need of improving the conventional clustering techniques.