The above referenced applications are incorporated herein by reference in their entireties.
1. Field of the Invention
The present invention is directed to data analysis and, more particularly, to representation of proximity data in multi-dimensional space.
2. Related Art
Multidimensional scaling (MDS) and non-linear mapping (NLM) are techniques for generating display maps, including non-linear maps, of objects wherein the distances between the objects represent relationships between the objects.
MDS and NLM were introduced by Torgerson, Phychometrika, 17:401 (1952); Kruskal, Psychometrika, 29:115 (1964); and Sammon, IEEE Trans. Comput., C-18:401 (1969) as a means to generate low-dimensional representations of psychological data. Multidimensional scaling and non-linear mapping are reviewed in Schiffman, Reynolds and Young, Introduction to Multidimensional Scaling, Academic Press, New York (1981); Young and Hamer, Multidimensional Scaling: History, Theory and Applications, Erlbaum Associates, Inc., Hillsdale, N.J. (1987); and Cox and Cox, Multidimensional Scaling, Number 59 in Monographs in Statistics and Applied Probability, Chapman-Hall (1994). The contents of these publications are incorporated herein by reference in their entireties.
MDS and NLM (these are generally the same, and are hereafter collectively referred to as MDS) represent a collection of methods for visualizing proximity relations of objects by distances of points in a low-dimensional Euclidean space. Proximity measures are reviewed in Hartigan, J. Am. Statist. Ass., 62:1140 (1967), which is incorporated herein by reference in its entirety.
In particular, given a finite set of vectorial or other samples A={ai, i=1, . . . , k}, a relationship function rij=r(ai, aj), with ai, ajxcex5A, which measures the similarity or dissimilarity between the i-th and j-th objects in A, and a set of images X={xi, . . . , xk; xixcex5Rm} of A on an m-dimensional display plane (Rm being the space of all m-dimensional vectors of real numbers), the objective is to place xi onto the display plane in such a way that their Euclidean distances dij=∥xixe2x88x92xj∥ approximate as closely as possible the corresponding values rij. This projection, which in many cases can only be made approximately, is carried out in an iterative fashion by minimizing an error function which measures the difference between the original, rij, and projected, dij, distance matrices of the original and projected vector sets.
Several such error functions have been proposed, most of which are of the least-squares type, including Kruskal""s xe2x80x98stressxe2x80x99:                     S        =                                                            ∑                                  i                   less than                   j                                k                            ⁢                              xe2x80x83                            ⁢                                                (                                                            r                      ij                                        -                                          d                      ij                                                        )                                2                                                                    ∑                                  i                   less than                   j                                k                            ⁢                              r                ij                2                                                                        EQ        .                  xe2x80x83                ⁢        1            
Sammon""s error criterion:                     E        =                                            ∑                              i                 less than                 j                            k                        ⁢                                                            (                                                            r                      ij                                        -                                          d                      ij                                                        )                                2                                            r                ij                                                                        ∑                              i                 less than                 j                            k                        ⁢                          r              ij                                                          EQ        .                  xe2x80x83                ⁢        2            
and Lingoes"" alienation coefficient:                     K        =                                                            ∑                                  i                   less than                   j                                k                            ⁢                              xe2x80x83                            ⁢                                                (                                                            r                      ij                                        ⁢                                          d                      ij                                                        )                                2                                                                    ∑                                  i                   less than                   j                                k                            ⁢                              d                ij                                                                        EQ        .                  xe2x80x83                ⁢        3            
where dij=∥xixe2x88x92xj∥ is the Euclidean distance between the images xi and xj on the display plane.
Generally, the solution is found in an iterative fashion by:
(1) computing or retrieving from a database the relationships rij;
(2) initializing the images xi;
(3) computing the distances of the images dij and the value of the error function (e.g. S, E or K in EQ. 1-3 above);
(4) computing a new configuration of the images xi using a gradient descent procedure, such as Kruskal""s linear regression or Guttman""s rank-image permutation; and
(5) repeating steps 3 and 4 until the error is minimized within some prescribed tolerance.
For example, the Sammon algorithm minimizes EQ. 2 by iteratively updating the coordinates xi using Eq 4:
xe2x80x83xpq(m+1)=xpq(m)xe2x88x92xcexxcex94pq(m)xe2x80x83xe2x80x83EQ. 4
where m is the iteration number, xpq is the q-th coordinate of the p-th image xp, xcex is the learning rate, and                                           Δ            pq                    ⁡                      (            m            )                          =                                            ∂                              E                ⁡                                  (                  m                  )                                                                    ∂                                                x                  pq                                ⁡                                  (                  m                  )                                                                          "LeftBracketingBar"                                                            ∂                  2                                ⁢                                  E                  ⁡                                      (                    m                    )                                                                              ∂                                                                            x                      pq                                        ⁡                                          (                      m                      )                                                        2                                                      "RightBracketingBar"                                              EQ        .                  xe2x80x83                ⁢        5            
The partial derivatives in EQ. 5 are given by:                                           ∂                          E              ⁡                              (                m                )                                                          ∂                                          x                pq                            ⁡                              (                m                )                                                    =                              -            2                    ⁢                      xe2x80x83                    ⁢                                                    ∑                                                      j                    =                    1                                    ,                                      j                    ≠                    p                                                  k                            ⁢                                                                                          r                      pj                                        -                                          ⅆ                      pj                                                                                                  r                      pj                                        ⁢                                          ⅆ                      pj                                                                      ⁢                                  (                                                            x                      pq                                        -                                          x                      jq                                                        )                                                                                    ∑                                  i                   less than                   j                                k                            ⁢                              r                ij                                                                        EQ        .                  xe2x80x83                ⁢        6                                                                    ∂              2                        ⁢                          E              ⁡                              (                m                )                                                          ∂                                                            x                  pq                                ⁡                                  (                  m                  )                                            2                                      =                              -            2                    ⁢                                                    ∑                                  i                   less than                   j                                k                            ⁢                                                1                                                            r                      pj                                        ⁢                                          ⅆ                      pj                                                                      ⁢                                  ⌊                                                            (                                                                        r                          pj                                                -                                                  ⅆ                          pj                                                                    )                                        -                                                                                                                        (                                                                                          x                                pq                                                            -                                                              x                                jq                                                                                      )                                                    2                                                                          ⅆ                          pj                                                                    ⁢                                              (                                                  1                          +                                                                                    (                                                                                                r                                  pj                                                                -                                                                  ⅆ                                  pj                                                                                            )                                                                                      ⅆ                              pj                                                                                                      )                                                                              ⌋                                                                                    ∑                                  i                   less than                   j                                k                            ⁢                              r                ij                                                                        EQ        .                  xe2x80x83                ⁢        7            
The mapping is obtained by repeated evaluation of EQ. 2, followed by modification of the coordinates using EQ. 4 and 5, until the error is minimized within a prescribed tolerance.
The general refinement paradigm above is suitable for relatively small data sets but has one important limitation that renders it impractical for large data sets. This limitation stems from the fact that the computational effort required to compute the gradients (i.e., step (4) above), scales to the square of the size of the data set. For relatively large data sets, this quadratic time complexity makes even a partial refinement intractable.
What is needed is a system, method and computer program product for representing proximity data in a multi-dimensional space, that scales favorably with the number of objects and that can be applied to both small and large data sets. Moreover, what is needed is a system, method and computer program product that can be effective with missing data and/or data containing bounded or unbounded uncertainties, noise or errors.
The present invention is a system, method and computer program product for representing precise or imprecise measurements of similarity/dissimilarity (relationships) between objects preferably as distances between points in a multi-dimensional space that represent the objects. The algorithm uses self-organizing principles to iteratively refine an initial (random or partially ordered) configuration of points using stochastic relationship/distance errors. The data can be complete or incomplete (i.e. some relationships between objects may not be known), exact or inexact (i.e. some or all relationships may be given in terms of allowed ranges or limits), symmetric or asymmetric (i.e. the relationship of object A to object B may not be the same as the relationship of B to A) and may contain systematic or stochastic errors.
The relationships between objects may be derived directly from observation, measurement, a priori knowledge, or intuition, or may be determined directly or indirectly using any suitable technique for deriving proximity (relationship) data.
The present invention iteratively analyzes sub-sets of objects in order to represent them in a multi-dimensional space that represents relationships between the objects.
In an exemplary embodiment, the present invention iteratively analyzes sub-sets of objects using conventional multi-dimensional scaling or non-linear mapping algorithms.
In another exemplary embodiment, relationships are defined as pair-wise relationships or pair-wise similarities/dissimilarities between pairs of objects and the present invention iteratively analyzes a pair of objects at a time. Preferably, sub-sets are evaluated pair-wise, as a double-nested loop.
In the following discussion, the terms relationship, similarity or dissimilarity is used to denote a relationship between a pair of objects. The term display map is used to denote a collection of images on an n-dimensional space that represents the original objects. The term distance is used to denote a distance between images on a display map that correspond to the objects.
Examples of the present invention are provided herein, including examples of the present invention implemented with chemical compound data and relationships. It is to be understood, however, that the present invention is not limited to the examples presented herein. The present invention can be implemented in a variety of applications.
For example, while the specific embodiment described herein utilizes distances between points to represent similarity/dissimilarity between objects, the invention is intended and adapted to utilize any display attribute to represent similarity/dissimilarity between objects, including but not limited to font, size, color, grey scale, italics, underlining, bold, outlining, border, etc. For example, the similarity/dissimilarity between two objects may be represented by the relative sizes of points that represent the objects.