1. Technical Field
The present invention generally relates to high dimensional data and, in particular, to methods for mining and visualizing high dimensional data through Gaussianization.
2. Background Description
Density Estimation in high dimensions is very challenging due to the so-called xe2x80x9ccurse of dimensionalityxe2x80x9d. That is, in high dimensions, data samples are often sparsely distributed. Thus, density estimation requires very large neighborhoods to achieve sufficient counts. However, such large neighborhoods could cause neighborhood-based techniques, such as kernel methods and nearest neighbor methods, to be highly biased.
The exploratory projection pursuit density estimation algorithm (hereinafter also referred to as the xe2x80x9cexploratory projection pursuitxe2x80x9d) attempts to overcome the curse of dimensionality by constructing high dimensional densities via a sequence of univariate density estimates. At each iteration, one finds the most non-Gaussian projection of the Xcurrent data, and transforms that direction to univariate Gaussian. The exploratory projection pursuit is described by J. H. Friedman, in xe2x80x9cExploratory Projection Pursuitxe2x80x9d, J. American Statistical Association, Vol. 82, No. 397, pp. 249-66, 1987.
Recently, independent component analysis has attracted a considerable amount of attention. Independent component analysis attempts to recover the unobserved independent sources from linearly mixed observations. This seemingly difficult problem can be solved by an information maximization approach that utilizes only the independence assumption on the sources. Independent component analysis can be applied for source recovery in digital communication and in the xe2x80x9ccocktail partyxe2x80x9d problem. A review of the current status of independent component analysis is described by Bell et al., in xe2x80x9cA Unifying Information-Theoretic Framework for Independent Component Analysisxe2x80x9d, International Journal on Mathematics and Computer Modeling, 1999. Independent component analysis has been posed as a parametric probabilistic model, and a maximum likelihood EM algorithm has been derived, by H. Attias, in xe2x80x9cIndependent Factor Analysisxe2x80x9d, Neural Computation, Vol. 11, pp. 803-51, May 1999.
Parametric density models, in particular Gaussian mixture density models, are the most widely applied models in large scale high dimensional density estimation because they offer decent performance with a relatively small number of parameters. In fact, to limit the number of parameters in large tasks such as automatic speech recognition, one assumes only mixtures of Gaussians with diagonal covariances. There are standard EM algorithms to estimate the mixture coefficients and the Gaussian means and covariance. However, in real applications, these parametric assumptions are often violated, and the resulting parametric density estimates can be highly biased. For example, mixtures of diagonal Gaussians       p    ⁡          (      χ      )        =            ∑              i        =        1            1        ⁢          xe2x80x83        ⁢                  π        i            ⁢              G        ⁡                  (                      χ            ,                          μ              i                        ,                          Σ              i                                )                    
roughly assume that the data is clustered, and within each cluster the dimensions are independent and Gaussian distributed. However, in practice, the dimensions are often correlated within each cluster. This leads to the need for modeling the covariance of each mixture component. The following xe2x80x9csemi-tiedxe2x80x9d covariance has been proposed:
xcexa3i=Axcex9iAT 
where A is shared and for each component, xcex9i is diagonal. This semi-tied co-variance is described by: M. J. F. Gales, in xe2x80x9cSemi-tied Covariance Matrices for Hidden Markov Modelsxe2x80x9d, IEEE Transactions Speech and Audio Processing, Vol. 7, pp. 272-81, May 1999; and R. A. Gopinath, in xe2x80x9cConstrained Maximum Likelihood Modeling with Gaussian Distributionsxe2x80x9d, Proc. of DARPA Speech Recognition Workshop, Feb. 8-11, Lansdowne, Va., 1998. Semi-tied covariance has been reported in the immediately preceding two articles to significantly improve the-performance of large vocabulary continuous speech recognition systems.
Accordingly, there is a need for a method that transforms high dimensional data into a standard Gaussian distribution which is computationally efficient.
The present invention is directed to methods for high dimensional data mining and visualization via Gaussianization. These methods are premised upon a method referred to herein as xe2x80x9cGaussianizationxe2x80x9d, which transforms high dimensional data into a standard Gaussian distribution. The Gaussianization method is an iterative method that converges. As is shown herein, the Gaussianization method is computationally more efficient than prior art methods that transform high dimensional data into Gaussian variables.
An iterative expectation maximization (EM) method is also provided herein, which increases the auxiliary function Q of the EM method with each iteration. The EM method of the invention may be advantageously employed in the Gaussianization method of the invention to obtain convergence.
According to a first aspect of the invention, a method is provided formining high dimensional data. The method includes the step of linearly transforming the high dimensional data into less dependent coordinates, by applying a linear transform of n rows by n columns to the high dimensional data. Each of the coordinates are marginally Gaussianized, the Gaussianization being characterized by univariate Gaussian means, priors, and variances. The transforming and Gaussianizing steps are iteratively repeated until the coordinates converge to a standard Gaussian distribution. The coordinates of all iterations are arranged hierarchically to facilitate data mining. The arranged coordinates are then mined.
According to a second aspect of the invention, the transforming step further comprises the step of applying an iterative maximum likelihood expectation maximization (EM) method to the high dimensional data.
According to a third aspect of the invention, the method further comprises the step of computing a log likelihood of the high dimensional data, prior to the transforming step.
According to a fourth aspect of the invention, the EM method comprises the step of computing an auxiliary function Q of the EM method based upon the log likelihood of the high dimensional data. The univariate Gaussian priors are updated, to maximize the auxiliary function Q. The univariate Gaussian variances, the linear transform, and the univariate Gaussian means, are respectively updated to maximize the auxiliary function Q. The transform is updated in the preceding step row by row. The second updating step is repeated, until the auxiliary function Q converges to a local maximum. The computing step and the second updating step are repeated, until the log likelihood of the high dimensional data converges to a local maximum.
According to a fifth aspect of the invention, the linear transform is fixed, when the univariate Gaussian variances are updated.
According to a sixth aspect of the invention, the univariate Gaussian variances are fixed, when the linear transform is updated.
According to a seventh aspect of the invention, the linear transform is fixed, when the univariate Gaussian means are updated.
According to an eighth aspect of the invention, the arranging step hierarchically arranges the coordinates of all the iterations in a tree structure.
According to a ninth aspect of the invention, a method is provided for visualizing high dimensional data. The method includes the step of linearly transforming the high dimensional data into less dependent coordinates, by applying a linear transform of n rows by n columns to the high dimensional data. Each of the coordinates are marginally Gaussianized, the Gaussianization being characterized by univariate Gaussian means, priors, and variances. The transforming and Gaussianizing steps are iteratively repeated until the coordinates converge to a standard Gaussian distribution. The coordinates of all iterations are arranged hierarchically into high dimensional data sets to facilitate data visualization. The high dimensional data sets are then visualized.
These and other aspects, features and advantages of the present invention will become apparent from the following detailed description of preferred embodiments, which is to be read in connection with the accompanying drawings.