The detection of changes in biological states, particularly the early detection of diseases has been a central focus of the medical research and clinical community. The prior art includes examples of efforts to extract diagnostic information from the data streams formed by physical or chemical analysis of tissue samples. These techniques are generically termed “data-mining.” The data streams that have been mined are typically of two forms: analysis of the levels of mRNA expression by hybridization to DNA oligonucleotide arrays (“DNA microarrays”) and analysis of the levels of proteins present in a cell or in a serum sample, wherein the proteins are characterized either by molecular weight using mass spectroscopy or by a combination of molecular weight and charge using a 2-D gel technique.
Rajesh Parekh and colleagues have described protein based data-mining diagnosis of hepatocellular cancer using serum or plasma samples (WO 99/41612), breast cancer using tissue samples (WO 00/55628) and rheumatoid arthritis using serum or plasma samples (WO 99/47925). In each publication, a two dimensional gel analysis is performed. The analysis consists of measuring the levels of individual proteins as determined by the 2-D gels and identifying those proteins that are elevated or depressed in the malignant as compared to the normal tissue.
Liotta and Petricoin (WO 00/49410) provide additional examples of protein based diagnostic methods using both 2-D gels and mass spectroscopy. However, the analysis of Liotta and Petricoin is similar to Parekh in that it consists of a search for specific tumor markers. Efforts to identify tumor markers have also been performed using DNA microarrays. Loging, W. T., 2000, Genome Res. 10, 1393-02, describes efforts to identify tumor markers by DNA microarrays in glioblastoma multiforme. Heldenfalk, I., et al., 2001, New England J. Med. 344, 539, report efforts to identify tumor markers that distinguish the hereditary forms of breast cancer resulting from BRCA1 and BRCA2 mutations from each other and from common idiopathic breast cancer by data-mining of DNA microarray data.
Alon et al., 1999, PNAS 96, 6745-50, describe the use of DNA microarray techniques to identify clusters of genes that have coordinated levels of expression in comparisons of colonic tumor samples and normal colonic tissue. These studies did, in fact, identify genes that were relatively over or under expressed in the tumor compared to normal tissue. However, the clustering algorithm was not designed to be able to identify diagnostic patterns of gene expression other than a tumor marker type pattern.
Data-mining efforts directed towards indicators other than tumor markers have been used for diagnosis. These efforts routinely employ pattern recognition methods to identify individual diagnostic markers or classify relationships between data sets. The use of pattern recognition methods to classify genes into categories based on correlated expression under a variety of different conditions was pioneered by Eisen, M., et al., 1998, PNAS 95, 14863-68; Brown, MPS, et al., 2000, PNAS 97, 262-67 and Alter, O., et al., 2000, PNAS 97, 10101-06. In general, these techniques use a vector space in which each vector corresponds to a gene or location on the DNA micro array. Each vector is composed of scalars that individually correspond to the relative levels of expression of the gene under a variety of different conditions. Thus, for example, Brown et al. analyzes vectors in a 79 dimension vector space where each dimension corresponds to a time point in a stage of the yeast life-cycle and each of 2,467 vectors correspond to a gene. The pattern recognition algorithms are used to identify clusters of genes whose expression is correlated with each other. Because the primary concern is the correlation of gene expression, the metric that is employed in the pattern recognition algorithms of Eisen et al. and related works is a Pearson coefficient or inner product type metric, not a Euclidean distance metric. Once clustering is established, the significance of each cluster is determined by noting any common, known properties of the genes of a cluster. The inference is made that the heretofore uncharacterized genes found in the same cluster may share one or more of these common properties.
The pattern recognition techniques of Eisen et al. were applied by Alizadeh and Staudt to the diagnosis of types of malignancy. Alizadeh and Staudt began by constructing vectors, each corresponding to a gene, having scalars that correspond to the relative level of expression of the gene under some differentiation condition, e.g., resting peripheral blood lymphocyte or mitogen stimulated T cells. The pattern recognition algorithm then clusters the genes according to the correlation of their expression and defines a pattern of expression characteristic of each differentiation state. Samples of diffuse large B-cell lymphomas (DLBCL) were then analyzed by hybridization of mRNA to the same DNA microarrays as used to determine the gene clusters. DLBCL were found to have at least two different gene expression patterns, each characteristic of a normal differentiation state. The prognosis of the DLBCL was found to correlate with the characteristic differentiation state. Thus, the diagnostic question posed and answered in Alizadeh and Staudt was not benign or malignant but rather of determining the type or subtype of malignancy by identifying the type of differentiated cell having a pattern of gene expression most similar to that of the malignancy. Alizadeh et al., 2000, Nature 403, 503-511. Similar techniques have been used to distinguish between acute myeloid leukemia and acute lymphocytic leukemia. Golub, T. R., et al., 1999, Science 286, 531-537.
Accordingly, it can be seen that data-mining methods based on the physical or chemical analyses having large numbers, i.e., greater than 1,000, of data points consist of two types: data-mining to identify individual markers such as genes or proteins having expression levels that are increased or suppressed in malignant cells of a defined type compared to normal cell; and data-mining wherein a pattern of known gene expression characteristic of a normal differentiated cell type is used to classify a known malignant cell according to the normal cell type it most closely resembles.
Thus, there is a need for methods that can determine biological states using biological data other than single markers (such as tumor markers), or gene expression clusters. Usually, the role that single markers play in the pathology of a disease must be known and established, quite often at great cost, prior to the analysis of a biological sample. Additionally, these markers are often localized in internal organs or tumors, and complex, invasive, localized biopsies must be performed to obtain biological samples containing such markers. Given the complexity of biological states such as a disease there is an exceptional need for the ability to diagnose biological states using complex data inherent to such biological states without prior extensive knowledge of the relationship of molecules present in such samples to each other.
Additionally, gene expression cluster analysis is limited in scope because such analysis incorporates an analysis of all expressed genes irrespective of whether the expression of such genes is causative or merely influenced by the causative action of those genes that are characteristic of the biological state. The clustering analysis does not incorporate solely those genes that are characteristic of the biological state of interest, but uses the entire range of data emanating from the assay, thus making it complex and cumbersome. Furthermore, gene expression analysis must involve nucleic acid extraction methods, making it complex, and time-consuming. Pattern recognition algorithms when applied are also rendered difficult because the correlation of gene expression that is employed is a complex Pearson coefficient or inner product type metric, and not a simple Euclidean distance metric.
In contrast to the prior art, the current invention discovers optimal hidden molecular patterns as subsets within a larger complex data field, whereby the pattern itself is discriminatory between biological states. Thus, the current invention avoids all the aforementioned problems associated with the analytical methods disclosed in the prior art, and has the ability to discover heretofore unknown diagnostic patterns. Such hidden molecular patterns are present in data streams derived from health data, clinical data, or biological data. Biological data may be derived from simple biological fluids, such as serum, blood, saliva, plasma, nipple aspirants, synovial fluids, cerebrospinal fluids, sweat, urine, fecal matter, tears, bronchial lavage, swabbings, needle aspirantas, semen, vaginal fluids, pre-ejaculate, etc., making routine sampling easy, although the expression of such molecular patterns are characteristic of disease states of remote organs. No prior knowledge of specific tumor markers or the relationship of molecules present in the biological sample to each other is required or even desired. The current invention also discloses methods of data generation and analysis. Such methods of data analysis incorporate optimization algorithms in which the molecular patterns are recognized, and subjected to a fitness test in which the fitness pattern that best discriminates between biological states is chosen for the analysis of the biological samples.