1. Field of the Invention
The invention pertains to the field of multivariate data analysis. More particularly, the invention pertains to methods of separating multivariate data clusters in lower dimensional space.
2. Description of Related Art
Modern medical procedures often generate more data on a single patient than can possibly be fully visualized, interpreted, or assimilated by a human being alone. Humans are comfortable perceiving three-dimensional (3D) images, albeit as a combination of stereoscopic two-dimensional (2D) images from each eye, and we can perceive their transformation in time, but higher dimensional (ND) space is not readily conceived. Thus, in a field such as medicine where one is often confronted with highly multi-dimensional data, we typically rely on a physician's intuition and expertise in being able to integrate all of the varied test results, observations, and patient history details that are each weighted differently, in the physician's mind, such that a proper, or at least a most-likely, diagnosis can be determined.
Diagnostic medical assays are often judged by their so-called receiver-operator characteristic (ROC) curve, in which the sensitivity of an assay is plotted against its specificity, and the area under the curve (AUC) helps objectively determine whether a given assay is clinically useful (see J. A. Hanley, “Receiver operating characteristic (ROC) methodology: the state of the art”, Crit. Rev. Diagn. Imaging, vol. 29, no. 3, 1989, pp. 307-35). The AUC value is normalized such that a perfect assay would have an AUC value of 1. Typically an AUC value of 0.75 or higher is considered to be clinically useful.
A host of statistical tools are available to assist in analyzing large data sets, including regression analysis, cluster analysis, and principal component analysis, not only in the medical field, but also in finance, meteorology, astronomy and any other field in which the dimensionality of the relevant data necessitates a computerized analysis of its patterns (see A. D. Flouris and J. Duffy, “Applications of artificial intelligence systems in the analysis of epidemiological data”, Eur. J. Epidemiol., vol. 21, no. 3, 2006, pp. 167-70; and G. Hellenthal and M. Stephens, “Insights into recombination from population genetic variation”, Curr. Opin. Genet. Dev., vol. 16, no. 6, 2006, pp. 565-72). However, statisticians do not currently employ tools that allow one to ideally weight the contribution of high-dimensional vectors, in order to produce a clearly-visible, lower-order separation of data clusters.
There have been a number of proposed algorithms for efficiently reducing data complexity and for properly analyzing data cluster subpopulations, using either unsupervised, semi-supervised, or supervised approaches (see W. P. Hanage and D. M. Aanensen, “Methods for data analysis”, Methods Mol. Biol., vol. 551, 2009, pp. 287-304; W. Shannon et al., “Analyzing microarray data using cluster analysis”, Pharmacogenomics, vol. 4, no. 1, 2003, pp. 41-52; H. Stenlund et al., “Orthogonal projections to latent structures discriminant analysis modeling on in situ FT-IR spectral imaging of liver tissue for identifying sources of variability”, Anal. Chem., vol. 80, no. 18, 2008, pp. 6898-906; and G. Yona et al. “Comparing algorithms for clustering of expression data: how to assess gene clusters”, Methods Mol. Biol., vol. 541, 2009, pp. 479-509). These range from simple statistical evaluations (e.g. Gaussian distributions, minimum covariance estimators) to more complex weighted functions (e.g. support vector data description, lp distance, orthogonal projections to latent structures), but there is typically a trade-off between ease-of-use and the quality of results that they provide. More complex multivariate analysis methods generally provide greater diagnostic/predictive power, but are often too advanced to be implemented or understood by a typical biomedical researcher, even one with access to complicated, and often quite expensive, multivariate data analysis software.
There is a need in the art for a method of finding a mathematical transformation that produces distinct, observable data clustering to allow future, unknown data to be categorized easily and reliably. In the medical field, this would involve finding a transformation matrix, or set of transformation matrices, that, when applied to the appropriate data vector that has been gathered for an individual patient, results in a highly accurate diagnosis of their disease state or states, inasmuch as the acquired data were relevant factors in determining the presence or absence of such states. Reducing data dimensionality for this kind of straightforward hypothesis testing would be applicable to other fields, including, but not limited to, economics, finance, insurance, meteorology, and astronomy.