This application pertains to the art of artificial intelligence, and more particularly to a system for organizing a large body of pattern data so as to organize it to facilitate understanding of features.
The subject system has particular application to analysis of acquired, empirical data, such as chemical characteristic information, and will be described with particular reference thereto. However, it will be appreciated that the subject system is suitably adapted to analysis of any set of related data so as to allow for visualization and understanding of the constituent elements thereof.
It is difficult to make sense out of a large body of multi-featured pattern data. Actually the body of data need not be large; even a set of 400 patterns each of six features would be quite difficult to "understand." A concept of self-organization has to do with that type of situation and can be understood in terms of two main approaches to that task. In one case, an endeavor is directed to discovering how the data are distributed in pattern space, with the intent of describing large bodies of patterns more simply in terms of multi-dimensional clusters or in terms of some other distribution, as appropriate. This is a dominant concern underlying the Adaptive Resonance Theory (ART) and other cluster analysis approaches.
In a remaining case, effort is devoted to dimension reduction. The corresponding idea is that the original representation, having a large number of features, is redundant in its representation, with several features being near repetitions of each other. In such a situation, a principal feature extraction which is accompanied by dimension reduction may simplify the description of each and all the patterns. Clustering is suitably achieved subsequently in the reduced dimension space. The Karhunen-Loeve (K-L) transform, neural-net implementations of the K-L transform, and the auto-associative mapping approach are all directed to principal component analysis (PCA), feature extraction and dimension reduction.
In actuality the two streams of activity are not entirely independent. For example the ART approach has a strong "winner-take-all" mechanism in forming its clusters. It is suitably viewed as "extracting" the principal prototypes, and forming a reduced description in terms of these few principal prototypes. The feature map approach aims at collecting similar patterns together through lateral excitation-inhibition so that patterns with similar features are mapped into contiguous regions in a reduced dimension feature map. That method clusters and reduces dimensions. The common aim is to let data self organize into a simpler representation.
A new approach to this same task of self-organization is described in herein. The idea is that data be subjected to a nonlinear mapping from the original representation to one of reduced dimensions. Such mapping is suitably implemented with a multilayer feedforward neural net. Net parameters are learned in an unsupervised manner based on the principle of conservation of the total variance in the description of the patterns.
The concept of dimension reduction is somewhat strange in itself. It allows for a reduced-dimension description of a body of pattern data to be representative of the original body of data. The corresponding answer is known for the linear case, but is more difficult to detail in the general nonlinear case.
A start of the evolution leading to the subject invention may be marked by noting the concept of principal component analysis (PCA) based on the Karhunen-Loeve (K-L) transform. Eigenvectors of a data co-variance matrix provide a basis for an uncorrelated representation of associated data. Principal components are those which have larger eigenvalues, namely those features (in transformed representation) which vary greatly from pattern to pattern. If only a few eigenvalues are large, then a reduced dimension representation is suitably fashioned in terms of those few corresponding eigenvectors, and nearly all of the information in the data would still be retained. That utilization of the Karhunen-Loeve transform for PCA purposes has been found to be valuable in dealing with many non-trivial problems. But in pattern recognition, it has a failing insofar as what is retained is not necessarily that which helps interclass discrimination.
Subsequent and somewhat related developments sought to link the ideas of PCA, K-L transform and linear neural networks. Such efforts sought to accomplish a linear K-L transform through neural-net computing, with fully-connected multilayer feedforward nets with the backpropagation algorithm for learning the weights, or with use of a Generalized Hebbian Learning algorithm. In this system, given a correct objective function, weights for the linear links to any of the hidden layer nodes may be noted to be the components of an eigenvector of the co-variance matrix. Earlier works also described how principal components may be found sequentially, and how that approach may avoid a tedious task of evaluating all the elements of a possibly very large co-variance matrix.
The earlier works begged the question of what might be achieved if the neurons in the networks were allowed to also be nonlinear. Other efforts sought to address that question. In one case, the original data pattern vectors are subjected to many layers of transformation in a multilayer feedforward net, but one with nonlinear internal layer nodes. An output layer of such a net has the same number of nodes as the input layer and an objective is to train the net so that the output layer can reproduce the input for all inputs. This provides a so-called auto-associative learning configuration. In addition, one of the internal layers serves as a bottle-neck layer, having possibly a drastically reduced number of nodes. Now, since the outputs from that reduced number of nodes can closely regenerate the input, in all cases, the nodes in the bottle-neck layer might be considered to be a set of principal components. That may prove to be an acceptable viewpoint, except for the fact that the solutions attained in such learning are not unique and differ radically depending on initial conditions and the order in which the data patterns are presented in the learning phase. Although the results are interesting, there is no unique set of principal components.
In another earlier feature map approach, dimension reduction is attained in yet another manner. A reduced-dimension space is suitably defined as two dimensional. The reduced-dimension space is then spanned by a grid of points and a pattern vector is attached to each of those grid points. These pattern vectors are chosen randomly from the same pattern space as that of the problem. Then the pattern vectors of the problem are allocated to the grid points of the reduced-dimension space on the basis of similarity to the reference vector attached to the grid. This leads to a biology inspired aspect of the procedure, namely that of lateral excitation-inhibition. When a pattern vector is allocated to a grid point, at first it would be essentially be at random, because of that grid point happening to have a reference vector most similar to the pattern vector. But once that allocation is made, the reference vector is modified to be even more like that of the input pattern vector and furthermore, all the reference vectors of the laterally close grid points are modified to be more similar to that input pattern also. In this way, matters are soon no longer left to chance; patterns which are similar in the original pattern space are in effect collected together in reduced dimension space. Depending on chance, sometimes two or more rather disparate zones can be built up for patterns which could have been relegated to contiguous regions if things had progressed slightly differently. On the other hand, results of that nature may not be detrimental to the objectives of the computational task.
The ART approach to self-organization of data can be mentioned in this context because the MAX-NET implements a winner-take-all approach in building up clusters and there is indeed lateral inhibition even though it is not related to the distance between cluster centers in cluster space. There is data compression but no dimension reduction.
According to a first aspect of the present invention, the above-noted problems and others, are addressed to provide a system for autonomous reduction of pattern dimension data to a largely unambiguous, two-dimensional representation using an extremely efficient system.
It is appreciated that many tasks in engineering involve the process of extracting useful information from unorganized raw data. However, as discussed above, it is a challenging task to make sense out of a large set of multidimensional data. The difficulty mainly lies in the fact that the inter-pattern relationship cannot be readily grasped. Visual display has been one of the most useful tools to guide this kind of analysis. Unfortunately, it is not directly possible to realize in a meaningful manner for dimensions higher than three.
As indicated above, the complexity of raw data must be reduced in order to understand the meaning thereof. Generally, two major categories of approaches are used to tackle this problem. In the first category, information such as the Euclidean distance between data patterns is used to infer how the data patterns are distributed in the multidimensional space, using methods such as clustering or Kohonen's self-organizing map (SOM). The emphasis of these methods is to describe large amounts of data patterns more concisely with cluster attributes or some other distributions.
The second category of approaches emphasizes the reduction of dimensions, i.e., the reduction of the number of features necessary to describe each and all of the data patterns. The idea is that perhaps the dimensions of the original data space are not all independent of each other, i.e. these dimensions may be some complicated functions of just a few independent inherent dimensions albeit not necessarily among those known. Accordingly, the objective is to use this reduced-dimension space to describe the patterns. Some methods belonging to this category are linear principal component analysis (PCA) through the Karhunen-Loeve (K-L) transform, neural-net implementations of PCA, the autoassociative mapping approach and the non-linear variance-conserving (NLVC) mapping. These methods generally try to map the high-dimensional space to the lower one. There are also methods to do the reverse. An example is generative topographic mapping (GTM), described in a paper by C. M. Bishop, M. Svensen and C. K. I. Williams entitled "GTM: The generative topographic mapping."
However it should be appreciated that the two categories discussed above are not entirely distinct. Clustering could be used subsequently in the the reduced-dimension space to further help the comprehension of the data. The SOM approach collects similar patterns together through lateral excitation-inhibition in a reduced-dimension feature map. Therefore, SOM both clusters and reduces dimension.
Except for linear PCA methods which are limited by their linearity nature already, other methods mentioned above either map the high dimensional data to discrete grid points in the lower dimensional space or the appearance of the lower dimensional map closely depends on the initial (usually random) choice of mapping parameters or both.
The grid point maps are usually useful in applications such as classification and encoding where exact relative positions of the data points are not of critical importance as long as close points in original data space remain close in the map. For example, the GTM approach starts with a grid of points in the lower dimension and a set of non-linear basis functions, which were assumed to be radially symmetric Gaussians evenly distributed in the lower dimensional space. A mapping of the grid points from the lower dimension to the higher dimension is assumed to be of a linear weighted sum of those basis functions. Then, the probability density of the higher dimension is proposed to be formed by radially symmetric Gaussians centered on those grid points just mapped to the higher dimension. In Bishop's works on GTM, it is assumed that the Bayes' rule can be used to invert the mapping and to estimate the responsibility of each grid point to the distribution in the higher dimensional space. The likelihood of data points in the higher dimension can then be re-estimated with the responsibility information. By optimizing this result to give the distribution of the known data points in the higher dimension, the iterative learning procedure of the weight parameters of the mapping and width parameters of the Gaussians forming the density distribution is obtained. A lower dimensional map of the data points for viewing can be generated by the responsibility information upon convergence of the learning. Provided that the mapping function is smooth and continuous, adjacent points in the lower dimension will map to adjacent points in the higher dimension. But the reverse is not necessarily true since for a given data point in the higher dimension the responsibilities of the Gaussians on grid points may be multi-modal due to the shape of the manifold generated by the mapping function. Instead of being the responsibility of one or a few adjacent grid points, the data point may be the responsibility of several distant grid points on the lower dimensional map. Although such a map may still be useful for some classification and similar purposes, it would be inappropriate to use this kind of a map for optimization since it would be difficult to interpret interpolation between grid points on such a map. Other grid point maps such as those obtained by SOM, may also have the same type of difficulty in interpreting interpolation between grid points.
Although a non-linear PCA type mapping such as the autoassociative mapping or NLVC mapping do not have the interpolation difficulty, the appearance of the lower dimensional map is usually dependent on the choice of initial parameters. This dependence is described below using NLVC mapping as an example. To obtain a map with good distribution of data points, a number of trials may be necessary until a satisfactory one can be found.
According to a second aspect of the present invention, the foregoing complexity-reduction problems, as well as others, are addressed. In this regard, an approach referred to as Equalized Orthogonal Mapping (EOM) is described herein. This approach falls into the second category and is developed with considerations on the interpolation capability and reduction of dependence on initial parameters in mind.
The EOM approach can be implemented through a backpropagation learning process. The detailed equations for this procedure are derived and described below. Examples of use of EOM in obtaining reduced dimension maps and comparisons with the SOM and NLVC approaches are also described. Moreover, results are given for two situations. In one case the input data is seemingly of 5 dimensions but is actually 2-D in nature. In another case, the mapping is applied to a body of gasoline blending data and potential use of the resulting map for optimization is demonstrated.
It should be appreciated that while the following description of the present invention is directed to mapping in cases where the reduced-dimension representation is of 2-D, so that the representation can be easily visualized, the present invention is suitable for other dimensions as well.