1. Field of the Invention
The present invention relates to the field of computing. More particularly, the present invention relates to a method and an apparatus for performing dynamic exploratory visual data analysis.
2. Description of the Related Art
Clustering, the grouping together of similar data points in a data set, is a widely used procedure for statistically analyzing data. Practical applications of clustering include unsupervised classification and taxonomy generation, nearest-neighbor searching, scientific discovery, vector quantization, text analysis and navigation, data reduction and summarization, supermarket database analysis, customer/market segmentation, and time series analysis.
One of the more popular techniques for clustering data in a data set is by using the k-means algorithm which generates a minimum variance grouping of data by minimizing the sum of squared Euclidean distances from cluster centroids. The popularity of the k-means algorithm is based on its ease of interpretation, simplicity of implementation, scalability, speed of convergence, parallelizability, adaptability to sparse data, and ease of out-of-core implementation. Variations of the k-means algorithm exist for numerical, categorical and mixed attributes. Variations of the k-means algorithm also exist for similarity measures other than a Euclidean distance.
Statistical and computational issues associated with the k-means algorithm have received considerable attention. The same cannot be said, however, for another key ingredient for multidimensional data analysis: visualization, or the exploratory data analysis based on dynamic computer graphics.
Conventional exploratory data analysis techniques use unsupervised dimensionality reduction methods for processing multidimensional data sets. Examples of popular conventional unsupervised dimensionality reduction methods used for projecting high-dimensional data to fewer dimensions for visualization include truncated singular value decomposition, projection pursuit, Sammon mapping, multi-dimensional scaling and a nonlinear projection method based on Kohonen's topology preserving maps.
Truncated singular value decomposition is a global, linear projection methodology that is closely related to principal component analysis (PCA). Projection pursuit combines both global and local properties of multi-dimensional data sets to find useful and interesting projections. For example, see J. Friedman et al., A projection pursuit algorithm for exploratory data analysis, IEEE Transactions on Computers, C-23, pp. 881-890, 1994; P. Huber, Projection pursuit (with discussion), Annals of Statistics, 13, pp. 435-525, 1985; and D. Cook et al., Grand tour and projection pursuit, Journal of Computational and Graphical Statistics, 4(3), pp. 155-172, 1995.
Sammon mapping and multi-dimensional scaling are each nonlinear projection methods used for projecting multi-dimensional data to fewer dimensions. For details regarding Sammon mapping, see, J. W. Sammon, A nonlinear mapping algorithm for data struction analysis, IEEE Transactions on Computers, Vol. 18, pp. 491-509, 1969. For details regarding multi-dimensional scaling, see J. B. Kruskal, Nonmetric multidimensional scaling: A numerical method, Psychometrika, Vol. 29. pp. 115-129, 1964, and J. B. Kruskal, Multidimensional scaling and other method for discovering structure, Statistical Methods for Digital Computers, K. Enslein et al. editors, Wiley, pp 296-339, 1977.
The basic idea for Sammon mapping and multi-dimensional scaling is to minimize the mean-squared difference between interpoint distances in the original space and interpoint distances in the projected space. The nonlinear mappings produced by Sammon's method and multi-dimensional scaling are difficult to interpret and are generally computationally expensive.
A recently proposed a nonlinear projection method for visualizing high-dimensional data as a two-dimensional image uses Kohonen's topology preserving maps. See, for example, M. A. Kraaijveld et al., A nonlinear projection method based on Kohonen's topology preserving maps, IEEE Transactions on Neural Networks, Vol 6(3), pp. 548-559, 1995. For background regarding Kohonen's topology preserving maps, see T. Kohonen, Self Organization and Associative Memory, Springer-Verlag, 1989. This approach generates only a 2-dimensional projection and not a set of projections, so it does not appear possible to construct guided tours based on this approach method.
What is needed is a way to visualize a multi-dimensional data set in relation to clusters that have been produced by the k-means algorithm. What is also needed is a way to visually understand the proximity relationship between the cluster centroids of a data set.