The invention pertains to the field of biostatistics, and more particularly to methods of classifying high dimensional biological data.
With the wealth of gene expression data from microarrays (such as high density oligonucleotide arrays and cDNA arrays) prediction, classification, and clustering techniques are used for analysis and interpretation of the data. Developments in the field of proteomics are expected to generate vast amounts of protein expression data by quantitating the amounts of a large number of different proteins within a cell or tissue. One can easily imagine carrying out experiments to generate large volumes of data that correlate, e.g., the expression patterns of proteins, mRNAs, cellular complements of membrane lipids, or other metabolic factors to a biologic response (e.g., sensitivity of a cell to a drug), to one of two biologic state (e.g., normal or disease states), or to one of a number of biologic states (e.g., one of a number of different tumor types.) One challenge of dealing with the large numbers of variables sampled using microarray technologies is developing methods to extract meaningful information from the data that can be used to predict or classify the biological state or response of a sample. Such methods would dramatically improve our ability to apply genomics or proteomics data to improve medical diagnoses and treatments.
The use of global gene expression data from microarrays for human cancer research is relatively new (DeRisi et al., 1996). However, since the introduction of DNA microarray technology to quantitate thousands of gene expressions simultaneously (Schena et al., 1995; Lockhart et al., 1996), there have been increasing activities in the area of cancer classification or discrimination. For example, Golub et al. (1999) used a weighted voting scheme for the molecular classification of acute leukemia based on gene expression monitoring from Affymetrix high-density oligonucleotide arrays. Also using Affymetrix oligonucleotide arrays Alon et al. (1999) used a cluster technique based on the deterministic-annealing algorithm to classify cancer and normal colon tissues. Scherf et al. (2000) and Ross et al. (2000) used classical clustering techniques such as average-linkage to cluster tumor tissues from various sites of origin: colon, renal, ovarian, breast, prostate, lung, central nervous system as well as leukemias and melanomas. The popular method of support vector machines (“SVM”) introduced by Vapnik was applied to the classification of tumor and normal ovarian tissues by Furey et al. (2000). The use of gene expression profiles to distinguish between negative and positive for BRCA1 mutation (as well as negative and positive for BRCA2 mutation) in hereditary breast cancer was described by Hedenfalk et al. (2001). Some other important applications in human cancer include classifying diffuse large B-cell lymphoma (“DLBCL”) (Alizadeh et al., 2000), mammary epithelial cells and breast cancer (Perou et al., 1999, 2000) and skin cancer melanoma (Bittner et al., 2000) based on gene expression data. Dudoit et al. (2000) and Ben-Dor et al. (2000) presented a comparative studies of classification methods applied to various cancer gene expression data. These techniques have also helped to identify previously undetected subtypes of cancer (Golub et al., 1999; Alizadeh et al., 2000; Bittner et al., 2000; Perou et al., 2000). The problem of deriving useful “predictions” from high dimensional data may come in various forms of applications as well, such as, e.g., using expression array data to predict patient survival duration with germinal center B-like DLBCL as compared to compared to those with activated B-like DLBCL using Kaplan-Meier survival curves (Ross et al., 2000).
Gene expression data from DNA microarrays is characterized by many measured variables (genes) on only a few observations (experiments), although both the number of experiments and genes per experiment are growing rapidly. The number of genes on a single array usually is in the thousands, so the number of variables p easily exceeds the number of observations N. Although, the number of measured genes is large there may only be a few underlying gene components that account for much of the data variation; for instance, only a few linear combinations of a subset of genes may account for nearly all of the response variation. Unfortunately, it is exceedingly difficult to determine which genes are members of the subset given the large number of genes, p, and the small number of observations, N. The fact that experiments such as, e.g., microarray experiments that are characterized by many measured variables (e.g., genes), p, on only a relatively few observations or samples, N, renders all statistical methods requiring N>p to be of no direct use.
While this problem has been described with reference to gene expression data from DNA microarrays, similar issues arise with any type of biological data in which the number of variables measured exceeds the number of observations, and the methods of the invention are applicable to many different types of biological data. Thus, there is a need in the art for methods of dealing with such “high dimensional” data (i.e., data that are statistically underdetermined because there are fewer observations, N, than the number of variables, p) to allow classification of biological samples. Methods are needed for binary classification (e.g., to discriminate between two classes such as normal and cancer samples, and between two types of cancers) based on high dimensional data obtained from the sample. Methods also are needed for classification or discrimination of more than two groups or classes (“multi-class”). The need for multi-class discrimination methodologies is apparent in many microarray experiments where various cancer types are simultaneously considered. The present invention addresses these and other shortcomings in the art by providing statistical methods of analyzing biological data to permit accurate classification of samples. The invention uses the method of partial least squares (“PLS”) (for binary classification) or the method of multivariate partial least squares (“MPLS”) (for multi-class classification) as a dimension reduction technique, followed by a classification step.