This invention relates generally to data modeling and analysis such as principal component analysis, and more particularly to Bayesian principal component analysis.
Data modeling has become an important tool in solving complex and large real-world computerizable problems. Applications of data modeling include data compression, density estimation and data visualization. A data modeling technique used for these and other applications is principal component analysis (PCA). It has proven to be a popular technique for data modeling applications, such as data compression, image analysis, visualization, pattern recognition, regression, and time-series prediction. Other data modeling applications in which PCA can be applied are density modeling for emission densities in speech recognition, clustering of data for data mining applications, and building class-conditional density models for hand-writing recognition.
A common definition of PCA is that for a set D of observed d-dimensional data vectors {tn}, nxcex5{1, . . . , N}, the q principal axes wj, jxcex5{1, . . . , q}, are those orthonormal axes onto which the retained variance under projection is maximal. As those of ordinary skill within the art can appreciate, it can be shown that the vectors wj are given by the q dominant eigenvectors (those with the largest associated eigenvalues) of the sample covariance matrix S=xcexa3n(tnxe2x88x92{overscore (t)})(tnxe2x88x92{overscore (t)})T/N such that Swj=xcexjwj and where {overscore (t)} is the sample mean. The vector xn=WT(tnxe2x88x92{overscore (t)}), where W=(w1, w2, . . . , Wq), is thus a q-dimensional reduced representation of the observed vector tn.
A limitation of conventional PCA is that it does not define a probability distribution. However, as described in the reference M. E. Tipping and C. M. Bishop, Probabilistic principal component analysis (1997), PCA can be reformulated as the maximum likelihood solution of a specific latent variable model. This solution is referred to as probabilistic PCA. However, as with conventional PCA, the model utilized provides no mechanism for determining the value of the latent-space dimensionality q. For q=dxe2x88x921 the model is equivalent to a full-covariance Gaussian distribution, while for q less than dxe2x88x921 it represents a constrained Gaussian distribution in which the variance in the remaining dxe2x88x92q directions is modeled by a single parameter "sgr"2. Thus, the choice of q corresponds to a problem in model complexity optimization. If data is plentiful, then cross-validation to compare all possible values of q offers a possible approach. However, this can quickly become intractable for mixtures of probabilistic PCA models if each component is desired to have its own q value.
For these and other reasons, there is a need for the present invention.
The invention relates to Bayesian principal component analysis. In one embodiment, a computer-implemented method for performing Bayesian PCA includes inputting a data model; receiving a prior distribution of the data model; determining a posterior distribution; generating output data based on the posterior distribution (such as, a data model, a plurality of principal components, and/or a distribution); and, outputting the output data. In another embodiment, a computer-implemented method includes inputting a mixture of a plurality of data spaces; determining a maximum number of principal components for each of the data spaces within the mixture; and, outputting the maximum number of principal components for each of the data spaces within the mixture.
Thus, the invention provides for a Bayesian treatment of PCA. A prior distribution, such as P(xcexc, W, "sgr"2), is received over the parameters of the inputted data model. The corresponding posterior distribution, such as P(xcexc, W, "sgr"2|D), is then obtained, for example, by multiplying the prior distribution by the likelihood function, and normalizing. In one embodiment, the output data is generated by obtaining a predictive density, by marginalizing over the parameters, so that
P(t|D)=∫∫∫P(t|xcexc, W, "sgr"2)P(xcexc, W, "sgr"2|D)dxcexcdWd"sgr"2.
To implement this framework, embodiments of the invention address two issues: the choice of prior distribution, and the formulation of a tractable algorithm. Thus, embodiments of the invention control the effective dimensionality of the latent space (corresponding to the number of retained principal components). Furthermore, embodiments of the invention avoid discrete model selection and instead utilize continuous hyper-parameters to determine automatically an appropriate effective dimensionality for the latent space as part of the process of Bayesian inference.