1. Field of the Invention
The invention described herein relates to information representation, information cartography and data mining. The present invention also relates to pattern analysis and representation, and, in particular, representation of object relationships in a multidimensional space.
2. Related Art
Reducing the dimensionality of large multidimensional data sets is an important objective in many data mining applications. High-dimensional spaces are sparse (Bellman, R. E., Adaptive Control Processes, Princeton University Press, Princeton (1961)), counter-intuitive (Wegman, E., J. Ann. Statist. 41:457–471 (1970)), and inherently difficult to understand, and their structure cannot be easily extracted with conventional graphical techniques. However, experience has shown that, regardless of origin, most multivariate data in Rd are almost never truly d-dimensional. That is, the underlying structure of the data is almost always of dimensionality lower than d. Extracting that structure into a low-dimensional representation has been the subject of countless studies over the past 50 years, and several techniques have been devised and popularized through the widespread availability of commercial statistical software. These techniques are divided into two main categories: linear and nonlinear.
Perhaps the most common linear dimensionality reduction technique is principal component analysis, or PCA (Hotelling, H., J. Edu. Psychol. 24:417–441; 498–520 (1933)). PCA reduces a set of partially cross-correlated data into a smaller set of orthogonal variables with minimal loss in the contribution to variation. The method has been extensively tested and is well-understood,  and several effective algorithms exist for computing the projection, ranging from singular value decomposition to neural networks (Oja, E., Subspace Methods of Pattern Recognition, Research Studies Press, Letchworth, England (1983); Oja, E., Neural Networks 5:927–935 (1992); Rubner, J., and Tavan, P., Europhys. Lett. 10:693–698 (1989)). PCA makes no assumptions about the probability distributions of the original variables, but is sensitive to outliers, missing data, and poor correlations due to poorly distributed variables. More importantly, the method cannot deal effectively with nonlinear structures, curved manifolds, and arbitrarily shaped clusters.
A more general methodology is Friedman's exploratory projection pursuit (EPP) (Friedman, J. H., and Tukey, J. W., IEEE Trans. Computers 23:881–890 (1974); Friedman, J. H., J. Am. Stat. Assoc. 82:249–266 (1987)). This method searches multidimensional data sets for interesting projections or views. The “interestingness” of a projection is typically formulated as an index, and is numerically maximized over all possible projections of the multivariate data. In most cases, projection pursuit aims at identifying views that exhibit significant clustering and reveal as much of the non-normally distributed structure in the data as possible. The method is general, and includes several well-known linear projection techniques as special cases, including principal component analysis (in this case, the index of interestingness is simply the sample variance of the projection). Once an interesting projection has been identified, the structure that makes the projection interesting may be removed from the data, and the process can be repeated to reveal additional structure. Although projection pursuit attempts to express some nonlinearities, if the data set is high-dimensional and highly nonlinear it may be difficult to visualize it with linear projections onto a low-dimensional display plane, even if the projection angle is carefully chosen.
Several approaches have been proposed for reproducing the nonlinear structure of higher-dimensional data spaces. The best-known techniques are self-organizing maps, auto-associative neural networks, multidimensional scaling, and nonlinear mapping.
Self-organizing maps or Kohonen networks (Kohonen, T., Self-Organizing Maps, Springer-Verlag, Heidelberg (1996)) were introduced by Kohonen in an attempt to model intelligent information processing, i .e. the ability of the brain to form reduced representations of the most relevant facts without loss of information about their interrelationships. Kohonen networks belong to a class of neural networks known as competitive learning or self-organizing networks. Their objective is to map a set of vectorial samples onto a two-dimensional lattice in a way that preserves the topology and density of the original data space. The lattice points represent neurons which receive identical input, and compete in their activities by means of lateral interactions. The main application of self-organizing maps is in visualizing complex multi-variate data on a 2-dimensional plot, and in creating abstractions reminiscent of these obtained from clustering methodologies. These reduced representations can subsequently be used for a variety of pattern recognition  and classification tasks.
Another methodology is that of auto-associative neural networks (DeMers, D., and Cottrell, G., Adv. Neural Info. Proces. Sys. 5:580–587 (1993); Garrido, L., et al., Int. J. Neural Sys. 6:273–282 (1995)). These are multi-layer feed-forward networks trained to reproduce their inputs as desired outputs. They consist of an input and an output layer containing as many neurons as the number of input dimensions, and a series of hidden layers having a smaller number of units. In the first part of the network, each sample is reorganized, mixed, and compressed into a compact representation encoded by the middle layer. This representation is then decompressed by the second part of the network to reproduce the original input. Auto-associative networks can be trained using conventional back-propagation or any other related technique available for standard feed-forward architectures. A special version of the multilayer perceptron, known as a replicator network (Hecht-Nielsen, R., Science 269:1860–1863 (1995)), has been shown to be capable of representing its inputs in terms of their “natural coordinates”. These correspond to coordinates in an m-dimensional unit cube that has been transformed elastically to fit the distribution of the data. Although in practice it may be difficult to determine the inherent dimensionality of the data, the method could, in theory, be used for dimensionality reduction using a small value of m.
The aforementioned techniques can be used only for dimension reduction. A more broadly applicable method is multidimensional scaling (MDS) or nonlinear mapping (NLM). This approach emerged from the need to visualize a set of objects described by means of a similarity or distance matrix. The technique originated in the field of mathematical psychology (see Torgeson, W. S., Psychometrika, 1952, and Kruskal, J. B. Phychometrika, 1964, both of which are incorporated by reference in their entirety), and has two primary applications: 1) reducing the dimensionality of high-dimensional data in a way that preserves the original relationships of the data objects, and 2) producing Cartesian coordinate vectors from data supplied directly in the form of similarities or proximities, so that they can be analyzed with conventional statistical and data mining techniques.
Given a set of k objects, a symmetric matrix, rij, of relationships between these objects, and a set of images on a m-dimensional display plane {yi, i=1, 2, . . . , k; yi∈Rm}, the problem is to place yi onto the plane in such a way that their Euclidean distances dij=∥yi−yj∥ approximate as closely as possible the corresponding values rij. The quality of the projection is determined using a loss function such as Kruskal's stress:
                    S        =                                                            ∑                                  i                  <                  j                                            ⁢                                                (                                                            d                      ij                                        -                                          r                      ij                                                        )                                2                                                                    ∑                                  i                  <                  j                                            ⁢                              r                ij                2                                                                        (        1        )            which is numerically minimized in order to find the optimal configuration. The actual embedding is carried out in an iterative fashion by: 1) generating an initial set of coordinates yi, 2) computing the distances dij, 3) finding a new set of coordinates yi using a steepest descent algorithm such as Kruskal's linear regression or Guttman's rank-image permutation, and 4) repeating steps 2 and 3 until the change in the stress function falls below some predefined threshold.
A particularly popular implementation is Sammon's nonlinear mapping algorithm (Sammon, J. W. IEEE Trans. Comp., 1969). This method uses a modified stress function:
                    E        =                                            ∑                              i                <                j                            k                        ⁢                                                            [                                                            r                      ij                                        -                                          d                      ij                                                        ]                                2                                            r                ij                                                                        ∑                              i                <                j                            k                        ⁢                          r              ij                                                          (        2        )            which is minimized using steepest descent. The initial coordinates, yi, are determined at random or by some other projection technique such as principal componenet analysis, and are updated using Eq. 3:yij(t+1)=yij(t)−λΔy(t)  (3)where t is the iteration number and λ is the learning rate parameter, and
                                          Δ            ij                    ⁡                      (            t            )                          =                                            ∂                              E                ⁡                                  (                  t                  )                                                                    ∂                                                y                  ij                                ⁡                                  (                  t                  )                                                                                                                                      ∂                  2                                ⁢                                  E                  ⁡                                      (                    t                    )                                                                              ∂                                                                            y                      ij                                        ⁡                                          (                      t                      )                                                        2                                                                                                    (        4        )            
There is a wide variety of MDS algorithms involving different error functions and optimization heuristics, which are reviewed in Schiffman, Reynolds and Young, Introduction to Multidimensional Scaling, Academic Press, New York (1981); Young and Hamer, Multidimensional Scaling: History, Theory and Applications, Erlbaum Associates, Inc., Hillsdale, N.J. (1987); Cox and Cox, Multidimensional Scaling, Number 59 in Monographs in Statistics and Applied Probability, Chapman-Hall (1994), and Borg, I., Groenen, P., Modem Multidimensional Scaling, Springer-Verlag, New York, (1997). The contents of these publications are incorporated herein by reference in their entireties. Different forms of NLM will be discussed in greater detail below.
Unfortunately, the quadratic nature of the stress function (Eqs. 1 and 2,  and their variants) make these algorithms impractical for large data sets containing more than a few hundred to a few thousand items. Several attempts have been devised to reduce the complexity of the task. Chang and Lee (Chang, C. L., and Lee, R. C. T., IEEE Trans. Syst., Man, Cybern., 1973, SMC-3, 197–200) proposed a heuristic relaxation approach in which a subject of the original objects (the frame) are scaled using a Sammon-like methodology, and the remaining objects are then added to the map by adjusting their distances to the objects in the frame. An alternative approach proposed by Pykett (Pykett, C. E., Electron. Lett., 1978, 14, 799–800) is to partition the data into a set of disjoint clusters, and map only the cluster prototypes, i.e. the centroids of the pattern vectors in each class. In the resulting two-dimensional plots, the cluster prototypes are represented as circles whose radii are proportional to the spread in their respective classes. Lee, Slagle and Blum (Lee, R. C. Y., Slagle, J. R., and Blum, H., IEEE Trans. Comput., 1977, C-27, 288–292) proposed a triangulation method which restricts attention to only a subset of the distances between the data samples. This method positions each pattern on the plane in a way that preserves its distances from the two nearest neighbors already mapped. An arbitrarily selected reference pattern may also be used to ensure that the resulting map is globally ordered. Biswas, Jain and Dubes (Biswas, G., Jain, A. K., and Dubes, R. C., IEEE Trans. Pattern Anal. Machine Intell., 1981, PAMI-3(6), 701–708) later proposed a hybrid approach which combined the ability of Sammon's algorithm to preserve global information with the efficiency of Lee's triangulation method. While the triangulation can be computed quickly compared to conventional MDS methods, it tries to preserve only a small fraction of relationships, and the projection may be difficult to interpret for large data sets.
The methods described above are iterative in nature, and do not provide an explicit mapping function that can be used to project new, unseen patterns in an efficient manner. The first attempt to encode a nonlinear mapping as an explicit function is due to Mao and Jain (Mao, J., and Jain, A. K., IEEE Trans. Neural Networks 6(2):296–317 (1995)). They proposed a 3-layer feed-forward neural network with n input and m output units, where n  and m are the number of input and output dimensions, respectively. The system is trained using a special back-propagation rule that relies on errors that are functions of the inter-pattern distances. However, because only a single distance is examined during each iteration, these networks require a very large number of iterations and converge extremely slowly.
An alternative methodology is to employ Sammon's nonlinear mapping algorithm to project a small random sample of objects from a given population, and then “learn” the underlying nonlinear transform using a multilayer neural network trained with the standard error back-propagation algorithm or some other equivalent technique (see for example, Haykin, S. Neural Networks: A Comprehensive Foundation. Prentice-Hall, 1998). Once trained, the neural network can be used in a feed-forward manner to project the remaining objects in the plurality of objects, as well as new, unseen objects. Thus, for a nonlinear projection from n to m dimensions, a standard 3-layer neural network with n input and m output units is used. Each n-dimensional object is presented to the input layer, and its coordinates on the m-dimensional nonlinear map are obtained by the respective units in the output layer (Pal, N. R. Eluri, V. K., IEEE Trans. Neural Net., 1142–1154 (1998)).
The distinct advantage of this approach is that it captures the nonlinear mapping relationship in an explicit function, and allows the scaling of additional patterns as they become available, without the need to reconstruct the entire map. It does, however, rely on conventional MDS methodologies to construct the nonlinear map of the training set, and therefore the method is inherently limited to relatively small samples.
Hence there is a need for a method that can efficiently process large data sets, e.g., data sets containing hundreds of thousands to millions of items.
Moreover, just like Mao and Jain (Mao, J., and Jain, A. K., IEEE Trans. Neural Networks 6(2):296–317 (1995)) and Pal and Eluri (Pal, N. R. Eluri, V. K., IEEE Trans. Neural Net., 1142–1154 (1998)), a method is needed that is incremental in nature, and allows the mapping of new samples as they become available, without the need to reconstruct an entire map.