1 Introduction
Analysis of high-dimensional data is encountered in numerous pattern recognition applications. It appears that often a small number of dimensions is needed to explain the high-dimensional data. Dimensionality reduction methods such as principal components analysis (PCA) [R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification and Scene Analysis. Wiley-Interscience, 2nd edition, 2000] and multidimensional scaling (MDS) [I. Borg and P. Groenen. Modern multidimensional scaling: Theory and applications. Springer Verlag, New York, 1997] are often used to obtain a low dimensional representation of the data. This is a commonly used pre-processing stage in pattern recognition.
Multidimensional scaling (MDS) is a generic name for a family of algorithms that, given a matrix representing the pair wise distances between a set of points in some abstract metric space, attempts to find a representation of these points in a low-dimensional (typically Euclidean) space. The distances in the target space should be as close to the original ones as possible. MDS algorithms are of great importance in the field of multidimensional data analysis. Originally introduced in the field of psychology (Warren S. Torgerson. Multidimensional scaling: I. Theory and method. PSym, 17:401-419, 1952; Jospeh B. Kruskal. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29:1-27, 1964), MDS algorithms have since then been applied to various applications. The most common applications include dimensionality reduction (Eric. L. Schwartz, Alan Shaw, and Estarose Wolfson. A numerical solution to the generalized mapmaker's problem: Flattening nonconvex polyhedral surfaces. IEEE Trans. Pattern Anal. Mach. Intell., 11:1005-1008, November 1989; Joshua B. Tenenbaum, Vin de Silva, and John C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319-2323, December 2000), visualization and analysis of data (for example, financial data (15; 33; 30), information retrieval (Brian T. Bartell, Garrison W. Cottrell, and Richard K. Belew. Latent semantic indexing is an optimal special case of multidimensional scaling. In SIGIR '92: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pages 161-167, New York, N.Y., USA, 1992. ACM Press), graph visualization (Emden R. Gansner, Yehuda Koren, and Stephen C. North. Graph drawing by stress majorization. In J'anos Pach, editor, Graph Drawing, volume 3383 of Lecture Notes in Computer Science, pages 239-250. Springer, 2004), texture mapping in computer graphics (Gil Zigelman, Ron Kimmel, and Nahum Kiryati. Texture mapping using surface flattening via multidimensional scaling. IEEE Transactions on Visualization and Computer Graphics, 8(2):198-207, 2002), bioinformatics (Ronald A. Fisher. The systematic location of genes by means of crossover observations. The American Naturalist, 56:406-411, 1922; Yoshihiro Taguchi and Yoshitsugu Oono. Relational patterns of gene expression via non-metric multidimensional scaling analysis. Bioinformatics, 21(6):730-740(11), March 2005; Payel Das, Mark Mol, Hernan Stamati, Lydia E. Kavraki, and Cecilia Clementi. Low-dimensional, free-energy landscapes of protein-folding reactions by nonlineardimensionality reduction. Proc. Natl. Acad. Sci. USA, 103(26):9885-9890, June 2006), etc (Keith T. Poole. Nonparametric unfolding of binary choice data. Political Analysis, 8(3):211-237, March 2000; Shusaku Tsumoto and Shoji Hirano. Visualization of rule's similarity using multidimensional scaling. In ICDM, pages 339-346, 2003; Ka W. Cheung and Hing C. So. A multidimensional scaling framework for mobile location using time-of-arrival measurements. IEEE Transactions on Signal Processing, 53(2):460-470, 2005). More recently, MDS methods have been brought into the computer vision community as efficient methods for non-rigid shape analysis and recognition (Asi Elad and Ron Kimmel. On bending invariant signatures for surfaces. IEEE Trans. Pattern Anal. Mach. Intell., 25(10):1285-1295, 2003; Alex M. Bronstein, Michael M. Bronstein, and Ron Kimmel. Expression-invariant face recognition via spherical embedding. In Proc. IEEE International Conf. Image Processing (ICIP), 2005). The data sets encountered in the above applications are often of a very large size. At the same time, nonlinear and non-convex nature of MDS problems tends to make their solution computationally demanding. As a result, MDS algorithms tend to be slow, which makes their practical application in large-scale problems challenging or even prohibitive. A number of low-complexity algorithms that find an approximate solution to an MDS problem have been recently proposed for large-scale settings (Christos Faloutsos and King-Ip Lin. FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Michael J. Carey and Donovan A. Schneider, editors, Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, pages 163-174, San Jose, Calif., 22-25 May 1995; Matthew Chalmers. A linear iteration time layout algorithm for visualizing high-dimensional data. In IEEE Visualization, pages 127-132, 1996; Jason Tsong-Li Wang, Xiong Wang, King-Ip Lin, Dennis Shasha, Bruce A. Shapiro, and Kaizhong Zhang. Evaluating a class of distance-mapping algorithms for data mining and clustering. In KDD '99: Proceedings of the fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 307-311, New York, N.Y., USA, 1999. ACM Press; Vin de Silva and Joshua B. Tenenbaum. Global versus local methods in nonlinear dimensionality reduction. In Suzanna Becker, Sebastian Thrun, and Klaus Obermayer, editors, Advances in Neural Inf. Proc. Sys., pages 705-712. MIT Press, 2002; Alistair Morrison, Greg Ross, and Matthew Chalmers. Fast multidimensional scaling through sampling, springs and interpolation. Information Visualization, 2(1):68-77, 2003; Tynia Yang, Jinze Liu, Leonard McMillan, and Wei Wang. A fast approximation to multidimensionalscaling. In IEEE workshop on Computation Intensive Methods for Computer Vision, 2006; Ulrik Brandes and Christian Pich. Eigensolver methods for progressive multidimensional scaling of large data. In Michael Kaufmann and Dorothea Wagner, editors, Graph Drawing, Karlsruhe, Germany, Sep. 18-20, 2006, pages pp. 42-53. Springer, 2007). Yet, some of the applications (for example, the representation of intrinsic geometry of non-rigid shapes in computer vision) require a (numerically) exact solution, which makes approximate MDS algorithms inappropriate. Recently, Bronstein et al. proposed an efficient multigrid solver for the exact MDS problem used for non-rigid shapes recognition (Michael M. Bronstein, Alex M. Bronstein, R. Kimmel, and I. Yavneh. Multigrid multidimensional scaling. Numerical Linear Algebra with Applications, Special issue on multigrid methods, 13(2-3):149-171, March-April 2006). The method showed significant improvement over traditional exact MDS algorithms. It made, however, heavy use of the underlying problem geometry. The generalization of this method to generic MDS problems, where the underlying geometry is not given explicitly, is not straightforward.
Multidimensional scaling can also be considered to be an alternative to factor analysis. In general, the goal of MDS is to detect meaningful underlying dimensions that allow analyzing the data in terms of an intermediate representation based on observed similarities or dissimilarities (distances) between the investigated objects. In factor analysis, the similarities between objects (e.g., variables) are expressed in the correlation matrix. MDS, on the other hand, deals with similarities or dissimilarities.
MDS is a general name for algorithms that try to find a point configuration in some metric space given a set of inter-point distances (dissimilarities). These algorithms try to minimize a cost function penalizing the deviation of inter-point distances in the point configuration from the input distances. This cost function is usually of the form
      F    ⁡          (      X      )        =            ∑                        i          =          1                          j          >          i                    N        ⁢                  w        ij            ⁢              f        ⁡                  (                                                    d                ij                            ⁡                              (                X                )                                      ,                                          (                D                )                            ij                        ,            X                    )                    
wherein f is the error cost associated with the distortion of a single distance, wij is the weight of a specific point pair, dij(X) is the inter-point dissimilarity in the new point configuration, and (D)ij is the given dissimilarity between the points i and j.
One specific choice for the function f is a squared error cost function. When dij(X) is the Euclidean distance, this choice gives us the stress cost function or least square scaling:
      STRESS    ⁢                  ⁢          (      X      )        =            ∑                        i          =          1                          j          >          i                    N        ⁢                            w          ij                ⁡                  (                                                    d                ij                            ⁡                              (                X                )                                      -                                          (                D                )                            ij                                )                    2      
Other alternatives include squared error between squared distance values (the SSTRESS cost function), the STRAIN cost function, or non-metric MDS cost functions, such as functions penalizing the ordering of the magnitude of the distances (ordinal MDS), to name a few [I. Borg and P. Groenen. Modern multidimensional scaling: Theory and Applications. Springer Verlag, New York, 1997].
Specifically, the MDS algorithm lends itself to the case where only a notion of similarity or dissimilarity exists between points in the data set, without any initial representing coordinates. These dissimilarities can be exact quantities, such as geometric data, or vague, as in many applications of MDS in psychology or marketing research. The dissimilarities are usually specific to the applicative domain (“Bank failure: a multidimensional scaling approach”/Cecilio Mar-Molinero, Carlos Serrano-Cinca, The European Journal of Finance, June 2001, Vol 7, No. 2) (“Visualization of Rule's Similarity using Multidimensional Scaling”, Shusaku Tsumoto and Shoji Hirano, Third IEEE International Conference on Data Mining (ICDM'03), pages 339-346, November 2003). In some domains, MDS is used on distances which do not translate to a concrete quantity, but are rather of a more abstract origin, such as in graph visualization. (“Graph Drawing by Stress Majorization”/E. Gansner, Y. Koren and S. North, Proceedings of 12th Int. Symp. Graph Drawing (GD'04), Lecture Notes in Computer Science, Vol. 3383, Springer Verlag, pp. 239-250, 2004).
One drawback of the MDS algorithm is its time complexity (O(N2) per iteration). It would be very useful to reduce the computational cost of the MDS algorithm, thereby allowing it to be used in new applications and scenarios.
1.1 Using MDS for Nonlinear Dimensionality Reduction
Often a more meaningful set of dissimilarities can be computed using only dissimilarities between closely related data elements. Typically, such local neighborhood distances are summed into global distances between distant data elements. This results in an approximation of the geodesic lengths [E. L. Schwartz, A. Shaw, and E. Wolfson. A numerical solution to the generalized mapmaker's problem: Flattening nonconvex polyhedral surfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11:1005-1008, November 1989; J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319-2323, December 2000]. Such algorithms allow using MDS as a non-linear dimensionality reduction method. Whereas applying linear dimensionality reduction methods to nonlinear data may result in a distorted representation, nonlinear dimensionality reduction (NLDR) methods attempt to describe a given high-dimensional data set of points as a low dimensional manifold, by a nonlinear map preserving certain properties of the data. This kind of analysis has applications in numerous fields, such as color perception, pathology tissue analysis [R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data. Proceedings of the National Academy of Sciences, 102(21):7426-7431, May 2005], enhancement of Magnetic Resonance Imager (MRI) images [R. M. Diaz and A. Q. Arencibia, editors. Coloring of DT-MRI FIber Traces using Laplacian Eigenmaps, Las Palmas de Gran Canaria, Spain, Feb. 24-28, 2003. Springer Verlag], face recognition [A. M. Bronstein, M. M. Bronstein, and R. Kimmel. Three-dimensional face recognition. International Journal of Computer Vision (IJCV), 64(1):5-30, August 2005], and biochemistry [Y. Keller, S. Lafon, and M. Krauthammer. Protein cluster analysis via directed diffusion. In The fifth Georgia Tech International Conference on Bioinformatics, November 2005], to mention a few.
As the input data, we assume to be given N points in the M-dimensional Euclidean space. The points constitute vertices of a proximity graph with the set of edges E; the points Zi, Zj are neighbors if (i, j)εE. The data points are samples of an m-dimensional manifold ⊂, where M>>m. The manifold is assumed to have a parametrization, represented by the smooth bijective map φ:⊂→
The goal is to find a set of points {xi}i=1N⊂ representing the parametrization. We will use the N×m matrix X representing the coordinates of the points in the parametrization space. Many NLDR techniques attempt finding an m-dimensional representation for the data, while preserving some local invariant. This, however, makes them vulnerable to noise in the data. Such methods include locally linear embedding (LLE) [S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323-2326, 2000], Laplacian eigenmaps [M. Belkin and P. Niyogi. Laplacian eigenmaps and spectral techniques for embedding and clustering. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems, volume 14, pages 585-591, Cambridge, Mass., 2002. MIT Press], Hessian LLE [C. Grimes and D. L. Donoho. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proceedings of the National Academy of Sciences, 100(10):5591-5596, May 2003], diffusion maps [R. R. Coifman, S. Lafon, A. B. Lee, M. Maggioni, B. Nadler, F. Warner, and S. W. Zucker. Geometric di usions as a tool for harmonic analysis and structure definition of data. Proceedings of the National Academy of Sciences, 102(21):7426-7431, May 2005], local tangent space alignment [Principal manifolds and nonlinear dimensionality reduction via tangent space alignment/ZHENYUE ZHANG and HONGYUAN ZHA, SIAM journal on scientific computing, 23(3), 2005], Semidefinite embedding [K. Q. Weinberger and L. K. Saul. Unsupervised learning of image manifolds by semidefinite programming. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 2, pages 988-995, Washington D.C., 2004. IEEE Computer Society], and other algorithms.
Another class of algorithms preserves global invariants, like the geodesic distances dij, approximated as shortest paths on the proximity graph, for example. unlike local approaches, the Isomap algorithm [E. L. Schwartz, A. Shaw, and E. Wolfson. A numerical solution to the generalized mapmaker's problem: Flattening nonconvex polyhedral surfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11:1005-1008, November 1989; J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319-2323, December 2000] considers both local and global invariants—the lengths of geodesics between the points on the manifold. Short geodesics are assumed to be equal to Euclidean distances, and longer ones are approximated as shortest path lengths on the proximity graph, using standard graph search methods like the Dijkstra's algorithm [T. H. Cormen, C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT Press and McGraw-Hill, 1990, E. W. Dijkstra. A note on two problems in connection with graphs. Numerische Mathematik, 1:269-271, 1959]. Isomap then uses multidimensional scaling (MDS) [I. Borg and P. Groenen. Modern multidimensional scaling: Theory and applications. Springer Verlag, New York, 1997. xviii+471 pp] attempting to find an m-dimensional Euclidean representation of the data, such that the Euclidean distances between points are as close as possible to the corresponding geodesic ones, for example, using the least squares criterion (STRESS),
      X    *    =            argmin              X        ∈                  R                      N            ×            m                                ⁢                  ∑                  i          <          j                    ⁢                                    w            ij                    ⁡                      (                                                            d                  ij                                ⁡                                  (                  X                  )                                            -                                                (                  D                  )                                ij                                      )                          2            where dij(X)=∥xi−xj∥2 is the Euclidean distance between points xi and xj in 
The main advantage of Isomap is that it uses global geometric invariants, which are less sensitive to noise compared to local ones. Yet, its underlying assumption is that  is isometric to ⊂ with the induced metric dC, that is, δ(zi,zj)=(xi,xj) for all I, j=1, . . . , N. If  is convex, the restricted metric dRm|C coincides with the induced metric dC and Isomap succeeds recovering the parametrization of . Otherwise,  has no longer Euclidean geometry and MDS cannot be used. The assumption of convexity of  appears to be too restrictive, as many data manifolds have complicated topology which violates this assumption. Donoho and Grimes [C. Grimes and D. L. Donoho. When does isomap recover the natural parameterization of families of articulates images? Technical Report 2002-27, Department of Statistics, Stanford University, Stanford, Calif. 94305-4065, 2002] showed examples of data in which C is non convex, and pointed out that Isomap fails in such cases. As an part of our work, we suggest a way to overcome this limitation of Isomap (cite: G. Rosman, A. M. Bronstein, M. M. Bronstein, R. Kimmel, Human Motion—Modeling, Tracking, Capture and Animation In B. Rosenhahn, R. Klette, D. Metaxas (Eds.)).