(1) Field of Invention
The present invention relates to a system for automatic data clustering and, more particularly, to a system for automatic data clustering which utilizes bio-inspired computing models.
(2) Description of Related Art
Data clustering is the assignment of objects into groups, or clusters, such that objects within the same cluster are more similar to one another than objects from different clusters. Several data clustering techniques exist in the art; however, the majority of these techniques require users to specify the number of clusters, which prevents automatic clustering of the data. The ability to automatically cluster large data sets plays an important role in many applications, non-limiting examples of which include image analysis, data mining, biomedical data analysis, and dynamic network analysis.
The primary challenges in the field of data clustering include defining a similarity measure that is optimal for a given application and automatically determining the number of clusters. Researchers have been addressing the challenges of data clustering for many decades. As a result, there are many clustering algorithms reported in the literature. The existing techniques can be grouped into two classes. The first is distribution-based clustering, such as the AutoClass algorithm. The second is distribution-free clustering, such as the K-mean method. In distribution-based clustering, one has to estimate the distributions within the data, and then use the estimated distributions to separate the data. On the other hand, in distribution-free clustering, one uses a minimal distance measure to iteratively separate the data.
Although many techniques have been proposed for data clustering, the key issue of automatically determining the number of clusters inside the data remains unsolved. In many applications, a human operator is involved in determining the number of clusters in the data. Some reported techniques which attempt to estimate the number of clusters automatically have not been entirely successful. Therefore, a robust technique that can estimate the number of clusters automatically has not been presented thus far.
Currently, there are two methods used to estimate the number of clusters in a data set. The first is to incrementally increase the number of clusters, then see which number produces the best result of data clustering. The second is to treat every data point as a cluster initially, then iteratively merge the clusters until the best clustering result is achieved. Both methods depend on the evaluation of the quality of data clustering, which is the most difficult problem. Conceptually, the best clustering result should have minimal averaged distances within a cluster and maximal distances between the clusters. The first requirement prefers a large number of clusters, but the second requirement prefers a small number of clusters. Consequently, there is no best way for determining the number of clusters. One of the reasons for this is that there is not a widely accepted method that can be used to evaluate the quality of data clustering.
Therefore, there is an existing need for a system, method, and computer program product which allows estimation of the number of clusters automatically through the use of bio-inspired computing models.