Data “clustering” involves grouping data in such a way that data in the same group (“cluster”) are more similar across one or more attributes to each other than to those in other clusters. Data clustering is commonly used in in many fields, including image analysis, machine learning, pattern recognition, information retrieval, and bioinformatics. Data clustering can be performed using various clustering methods of computation, including “K-means” clustering.
K-means clustering partitions a dataset of observations into K clusters in which each observation belongs to the cluster with the nearest mean, which serves as a prototype or “centroid” of the cluster. The observations may be represented using high-dimensional data vectors. As one example in the field of image recognition, the dataset may comprise a number of images of various apparel, e.g., jackets, with each data point in the dataset being a 64×64 grayscale pixel image. K-means clustering algorithm can be used to find groups of images that represent similar-looking jackets.
FIG. 1 is a flow diagram illustrating a typical K-means clustering routine 100. At block 105, given a dataset of N data vectors and a specific value K, the routine 100 may randomly classify the N data vectors into K initial clusters. At block 110, the system computes the centroids of the K clusters. Though referred to as a “centroid”, one with ordinary skill in the art will recognize that the terms “average”, “mean”, or “Nth-moment” are equivalent. At block 115, the routine 100 determines K new clusters by associating each data vector with a nearest centroid. Various measures can be used to represent the distance between a data vector and a centroid, e.g., the Euclidian distance or the cosine distance. At decision block 120, the routine 100 determines if an appropriate end condition has been reached. For example, the routine 100 may stop after a predetermined number of iterations, or when each successive centroid is less than a threshold distance from its predecessor in the previous iteration. If the end condition has not been reached, the routine 100 proceeds to another iteration returning to block 110, where the centroids of the K new clusters are determined.
Although K-means clustering produces good results for clustering, applying it can be computationally difficult because its computation is said to be “NP-complete.”
The headings provided herein are for convenience only and do not necessarily affect the scope or meaning of the claimed embodiments. Further, the drawings have not necessarily been drawn to scale. For example, the dimensions of some of the elements in the figures may be expanded or reduced to help improve the understanding of the embodiments. Similarly, some components and/or operations may be separated into different blocks or combined into a single block for the purposes of discussion of some of the embodiments. Moreover, while the various embodiments are amenable to various modifications and alternative forms, specific embodiments have been shown by way of example in the drawings and are described in detail below. The intention, however, is not to limit the particular embodiments described. On the contrary, the embodiments are intended to cover all modifications, equivalents, and alternatives falling within the scope of the disclosed embodiments as defined by the appended claims.