Analyzing a dataset may require determining a structure in the data. This structure may depict observations, objects or data in the dataset as similar in some fashion. The analysis may constitute partitioning the dataset into a number of clusters, each cluster including those data in the dataset similar to some datum representative of the cluster. Certain methods of partitioning datasets, such K-means and K-medoids, may attempt to minimize a metric, such as a sum of distances between a datum and a center of the cluster including the datum. The K-means method determines mean values, called centroids, as the centers of the clusters. The K-medoids method determines exemplary members of the clusters, called medoids, as the centers of the clusters.
Methods partitioning datasets around medoids have advantages over methods partitioning datasets around centroids. Methods partitioning datasets around medoids may be more broadly applicable. Such methods may be used when the center of the cluster must fall within the data domain of the dataset. For example, the data in the dataset may be associated with a discrete-valued variable, in which case a mean may be impossible to define. Even defining a median for categorical data taking values such as “red,” “blue,” or “green” may be difficult. Methods partitioning datasets around medoids may also be more robust in the presence of spurious outlying data.
However, existing methods of partitioning datasets around medoids scale poorly with increasing dataset size and complexity. These methods require significant amounts of memory; the number of required calculations may increase with the square of the number of data in the dataset. Thus methods and systems for analyzing datasets are needed that are robust to outliers, permit clustering of discrete and continuous data, scale efficiently to larger datasets, and execute more quickly and using less memory than existing methods known to one of skill in the art.