Classification of tissues historically depended on an examination of the gross morphology and histology of the tissues. The inadequacy of these methods for classification of tissues that are similar in appearance, such as different tumors arising from the same tissue or organ, is known. More recently, methods of tissue analysis and classification have been developed that rely on genetic analysis of tissues. While these genetic analytical methods are more powerful for distinguishing tissue types, and are simple to practice, the methods also generated quantities of data that are orders of magnitude greater than classical histology methods. For example, gene expression profiling of tissues by the application of microarray technology determines simultaneously the expression of thousands of genes. The challenge of identifying the data useful in tissue classification from the raw data is a problem that requires a solution that can be applied in clinical settings. Present statistical and data display methods are not satisfactory for this application.
Thus it is of key importance in analytical processes for classification of unknown tissues reduce large quantities of multivariate data to a quantity that can be readily analyzed. A typical treatment of multivariate data is to generate binary statistics known to those in the art. For example, a binary regression of two data sets associated with two objects will produce a statistical relationship between the two objects within a statistical confidence.
Prior art data analysis tools have focused on describing objects of diagnostic relevance to exhaustion, which in turn relied on using binary statistical methods to “collapse” large volumes of data for decision making. These methodologies suffer from a requirement for sophisticated analysis and interpretation by the end user. The prior art approaches thus lack methods for analyzing large quantities of multivariate data using statistically proven methods, yet providing for ready interpretation of the data.
Highly parallel quantitative measurement systems are increasingly available for analysis of complex biological systems, but practical data reduction systems limit application. For example, microarray (RNA) expression analysis is used for phenotyping of human tissues; quantitative measurement of thousands of RNAs (“variables”) in a single tissue (“object”) is used to assign group membership (“classify”). Typically, a very large number of variables are measured in a training set of two different tissues, and a subset of variables which best distinguishes the two tissue types is identified through some statistical process. This can yield dozens, or even hundreds, of variables which are individually imperfect, but collectively effective, for classification of future (test) objects which were not part of the training set.
Once a predictive model is constructed from a suitable training set, it is desirable to have a simple method to generalize this model to a variety of variable measurement systems (different manufacturer's gene arrays, for example). This is not readily possible if the different platforms have independent units of measurement, and different sensitivity thresholds. Diagnostic and prognostic evaluation of tissues is hampered by this cross-platform incompatibility. Similar problems exist with respect to sample-to-sample, run-to-run and user-to-user variability when using the same device or an identical device.
In addition, the variables are not easily evaluated for any sample, even with the training set data as a comparison. For example, gene expression data frequently is provided as a table describing the increases in expression of the set of genes. This type of display requires the end user to compare large sets of data for the expression of multiple genes and make judgments as to the phenotype of a tissue based on these multiple parameters. Accordingly, there is a need to develop methods of obtaining and analyzing data that describe a tissue more accurately and more precisely for diagnostic evaluation. Further, there is a need for methods of displaying multivariate data without reference to the specific units of data generated by specific analytical devices. There also is a need for methods and devices to display the data for rapid and simplified evaluation by end users, particularly in a clinical setting.