1. Field of the Invention
Embodiment of this invention relate generally to clustering. More particularly, an embodiment of the present invention relates to k-means clustering using t-test computation.
2. Description of Related Art
Clustering is mathematical formulation-based measure of similarity between various objects. Clustering is used to obtain a collection of objects which are similar between them and are dissimilar to the objects belonging to other clusters. This multivariate statistical analysis-type clustering is also known as unsupervised clustering analysis, numerical taxonomy, and classification analysis. For example, in molecular biology, clustering is used to group or sort biological genes or samples into separate clusters based on their statistical behavior so that the degree of association is strong between members of the same cluster and weak between members of different clusters. Examples of clustering techniques include Jarvis-Patrick, Agglomerative Hierarchical, Self Organizing Map (SOM), and K-means.
K-means clustering is a simple unsupervised learning algorithm that is used to solve some well-known clustering problems. The K-means algorithm is used to generate fixed-sized, flat classifications and clusters based on distance metrics for similarity. The conventional K-means clustering algorithm follows a simplistic way to classify a given data set through a given number of clusters (e.g., k clusters) fixed at a pirori. Stated differently, the k-means algorithm starts with an initial partition of the cases into k clusters (e.g., assign a value of k at initialization). The process then continues with modification of the partition to reduce the sum of the distances for each case from the mean of the cluster to which the case belongs. One problem with the conventional k-means is that a certain initial value of k has to be assigned based only on estimations. Such value of k is often incorrect and negatively impacts the final result.
One way to reduce the impact of the k value is to rerun the algorithm with different randomly generated starting partitions or initial k values. Because the number of true clusters in the data is not known, the algorithm is run with different values for k that are closer to the number of clusters expected from the data to determine how the sum of distances reduces with increasing values of k. However, this conventional approach of rerunning the k-means algorithm is time consuming, inefficient, cumbersome, and still does not eliminate or significantly reduce the negative impact of k on the final solution.