A variety of graphical representations have been developed to communicate the information in a dataset more effectively to an audience. For example, a pie chart can be used to show at a glance the relative proportions of several ingredients or sub-quantities that make up a whole, and a bar chart can show the development of a metric over discrete periods of time. A scatter chart can convey information about a plurality of observations of a pair of numeric quantities: the observations are plotted on a rectilinear graph, and the viewer can apprehend characteristics of the observations that would be harder to understand from simply examining a table of numbers. For example, FIG. 4 shows a simple scatter plot of 1,000 random X,Y pairs of numbers drawn from Gaussian distributions having means at (2,0) and (−2,0) (standard deviation=1). The centers of the two sets of numbers can be identified by inspection: the two higher-density clumps of the scatter plot appear to be centered around (−2,0) and (2,0).
When more than two quantities need to be represented on a scatter chart, they may be shown as different symbols or different colors on a two-dimensional plot, or as points in a three-dimensional volume, displayed by a sophisticated apparatus or mapped to a two-dimensional image through conventional computer-graphics processing. However, a direct-view representation of large numbers of records, comprising more than four or five quantities each, is cumbersome and impractical. For example, a multivariate data set collected by British statistician and biologist Ronald Fisher and published in his 1936 paper The use of multiple measurements in taxonomic problems, is often used to illustrate (and test) statistical data analyses. The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor); each sample comprises measurements of the length and the width of the sepals and petals of the specimen, plus its species. A few example records are reproduced here, and the two-dimensional scatter plots of FIG. 5 show each measurement in relation to the others, with the species represented by different symbols in the plot.
SepalSepalPetalPetalLengthWidthLengthWidthSpecies5.13.51.40.2Iris-setosa4.93.01.40.2Iris-setosa7.03.24.71.4Iris-versicolor6.43.24.51.5Iris-versicolor6.33.36.02.5Iris-virginica5.82.75.11.9Iris-virginica
One of the species can be distinguished by inspection of many of the plots, but visually distinguishing all three species is virtually impossible. Nevertheless, a variety of statistical techniques (including embodiments of the invention described below) make quick work of this task—but only because the total number of data records is small.
The purpose served by a scatter graph—to highlight groups of similar or related observations—is an important approach to understand and use data sets containing extremely large numbers of records, each with many associated measurements, quantities or qualities. In particular, the ability to identify groups or clusters within observations that include both quantitative (numeric, often continuous) elements and qualitative (discrete) elements, is of substantial value in many fields. Unfortunately, popular techniques for emulating what a human accomplishes easily by looking at a two-dimensional scatter chart, are computationally expensive when performed on large (e.g., millions of records) datasets containing many characteristics per record. For example, the standard K-Medoids algorithm is an important and popularly-practiced clustering technique for mixed data. The basic and highest-quality K-Medoids routine is Partitioning Around Medoids, or “PAM.” PAM is an iterative algorithm, iteratively seeking to improve the clustering quality. Unfortunately, it is computationally expensive. The cost of PAM is quadratic in the number of observations, a rate of growth that makes it infeasible for modern data set sizes. Existing PAM variants such as CLARA, CLARANS and pivot-based K-Medoids also suffer from various drawbacks. For example, CLARA can result in poor-quality clusters. Under the ubiquitous situation of constant number of iterations, CLARANS is also of quadratic cost. Further, CLARANS does not provide any formal or mathematical guarantees on the per-iteration quality vis-a-vis PAM. Pivoted K-Medoids only works for Euclidean metrics.
Approaches that reduce the computational complexity of clustering mixed data while offering theoretical guarantees on the quality of the results may be of substantial value in many fields.