Advances in biological and computational methods are providing numerous opportunities for improving human health. Two areas of particular importance include genomic and proteomic variation. Human beings, like members of many species, share a remarkable degree of genetic similarity. It is estimated that the DNA of any two people chosen at random is near 99.9% identical. It is the variation of the remaining ˜0.1 % that is responsible for heritable traits that confer upon us our individually recognizable features such as hair/eye/skin color, body shape and size, facial characteristics, personality traits, and so forth. In addition to these rather obvious distinguishing features, the genetic variations referred to above also confer upon us certain other characteristics, some desirable and others not so, such as susceptibility to certain specific diseases (e.g. cancer, heart disease, diabetes, etc.), or conversely, traits which may help to protect us from some diseases (e.g. genetically low cholesterol which may help protect us from cardiovascular disease).
The recent sequencing and publication of the entire human genome, as discussed for example by Collins, F. S., et al. (1987, Cytogenet. Cell Genet. 46:597), has set the stage for a level of understanding of causes and cures for many human diseases not heretofore possible. There are many diseases which have undisputed genetic factors, yet for which the specific genes or combinations of genes remain undiscovered. The reason is in part due to the fact that some diseases have essentially monogenic bases. This means that there is a single gene that is essentially responsible for the disease. Individuals who carry the gene have a high likelihood of contracting the disease. In contrast, many more diseases are presumably polygenic in origin, meaning that there are two or more (possibly many) genes whose simultaneous presence, and possible interactions, are required to cause the disease. The discovery of these polygenic systems will be facilitated by the results from the completion of sequencing of the human genome and by the present invention, as described herein.
Since the human genome appears to be highly similar across individuals, it is of interest to determine the nature of the variations from one individual to another. Research has determined that genetic variations tend to occur at individual nucleotides, rather than over large lengths of nucleotides. These variations have come to be known as Single Nucleotide Polymorphisms, or “SNPs”. Moreover, most genetic variations do not seem to occur at arbitrary locations, but at relatively highly-conserved locations spread (not necessarily uniformly) over the genome on the order of once in every few hundred to few thousand locations. In one aspect, this opens the possibility of creating a SNP map by comparing the genomes of several (or many) individuals. If the intervening conserved nucleotides are ignored, then the remaining SNPs represent the actual differences among individuals. This is significant from a computational perspective since this reduces the number of locations to be processed from approximately three billion for the entire genome to perhaps a few million for only the SNPs. It is now possible to see how the knowledge of the sequence of the human genome is significant: it represents, among other things, a constant reference against which the genomes of individuals may be compared to determine sites of polymorphism.
Until the present invention, most methods of SNP analyses rely on determining Linkage Disequilibrium (LD) among SNPs, as described for example by Reich et al. (2001, Nature 411:199-204). LD is essentially a measure of non-randomness in the distribution of alleles among individuals. In population genetics studies, genetically homogeneous populations are used specifically to limit sources of genetic variation and thus focus attention on factors responsible for disease. In contrast, participants in drug trials are genetically much more heterogeneous, making perception of causal genetic factors more difficult; individual SNPs have insufficient correlation with phenotype (disease, drug response, adverse reaction) to be easily detected. In any case, a key limitation of most methods of determining LD is that SNPs are compared in a pairwise fashion. Extended relationships among arbitrary numbers of SNPs are still very difficult to determine. Patterns of SNPs on the other hand should be significantly more specifically correlated with phenotype than any of the individual SNPs comprising them. This is the subject of the present invention, in part, disclosed in more detail elsewhere herein.
Another method that is used to render LD calculations tractable is the use of a so-called “candidate gene” approach, wherein SNPs within a local region of a chromosome or SNPs within genes that are thought to act together (e.g. on the basis of biochemical pathway analysis) are compared. Such an analysis can be thought of as being model-driven, in that it is presumed a priori that certain genes comprise the pool of possible interactions based upon some model—in this case for example those genes within a certain locale in a chromosome, or those genes that are known to interact by means of their participation in a common biochemical pathway. Clearly, there may be circumstances that violate these model assumptions. Any method that restricts analysis to only those genes consistent with the model will be blind to combinations of polymorphisms that lie outside the scope of the model. Conversely, an analysis method that is model-independent and capable of examining all possible interactions will be able to find unforeseen (i.e. un-modeled) interactions.
Due in part to familial relationships through generations (i.e. the fact that relationships among individuals are not strictly random but are to some extent correlated) it will be appreciated that certain patterns of polymorphisms tend to recur. These recognizable patterns are referred to as “haplotypes”. Methods used to discover patterns have been described, for example, by Hitt et al. (U.S. patent application Publication No. US2003/0004402A1) and Wall et al. (U.S. patent application Publication No. US2004/0052412A1). It has further been shown by Daly et al. (2001, Nat. Genet. 29:229-32), that in some cases SNPs within a certain locale tend to be correlated. This leads to the concept of a “haplotype block”, in which the SNPs within the block take on a small number of the possible permutations. For example, it has been shown in some cases that, though there may be thousands of theoretically possible SNP permutations within a block, only a handful are actually observed. This has to do with the molecular mechanisms underlying genetic variation.
In the area of proteomics, serum biomarkers provide an attractive method of screening for disease since they are noninvasive and relatively inexpensive. Such tests can also be used as adjuncts to other screening or diagnostic tests and help establish prognosis, response to therapy, and risk of recurrence. Although sensitive cancer diagnostics based on biological fluids have great potential, only a handful of useful serological tumor markers have been identified so far. One of the most successful tumor markers identified so far is PSA (Mikolajczyk et al., 2000, Cancer Res. 60:756-759; Bangma et al., 1995, Urology 46:779-784). Other serum markers include CEA (carcinoembryonic antigen), CA19-9 in gastrointestinal tumors, and CA125 in ovarian cancer (Carl et al., 1990, Tumor Biol. 11:88; Kouri et al., 1992, J. Surg. Oncol. 49:78-85; Hunter et al., 1990, Am. J. Obstet. Gynecol. 163:1164-1167), but these and most other serological markers identified so far are not sufficiently specific. Most known serological markers have so far been discovered in an ad hoc, indirect manner; e.g., proteins observed to be over-expressed in a tumor or secreted into culture medium by tumor cells were subsequently tested in patients' sera. However, this ad hoc, single-protein-at-a-time approach has clearly encountered limited success and is not an optimal strategy.
Recently, proteomic techniques have been successfully applied to identification of changes in protein expression that correlate with early stage cancers. For example, the Surface Enhanced Laser Desorption /Ionization mass spectrometry (SELDI MS) technique was recently used to identify changes in blood protein mass spectra in patients with ovarian cancer (Petricoin et al., 2002, Lancet 359:572-577). Mortality from ovarian cancer is high, often because it is diagnosed at late, incurable stages. The study above showed that the earlier, more curable stages of ovarian cancer are associated with changes in blood protein MS spectra and could form the basis for early detection. However, this study also generated substantial criticism concerning SELDI's limited reproducibility and the fact that it is difficult to identify proteins responsible for diagnostic signals. Despite these concerns, the study clearly demonstrates that: 1) early markers of cancer do exist in serum and 2) the best diagnostics are likely to be patterns of multiple markers (biosignatures) rather than single proteins.
In similar studies, SELDI MS has been used to distinguish control subjects from patients with prostate cancer (Wright et al., 1999, Prostate Cancer Prostatic. Dis. 2:264-276; Adam et al., 2002, Cancer Res. 62:3609-3614), pancreatic cancer (Valerio et al., 2001, Rapid Commun. Mass Spectrom. 15:2420-2425), and breast cancer (Li et al., 2002, Clin. Chem. 48:1296-1304).
Existing analysis methods have proven effective in detecting causes of monogenic diseases, but polygenic diseases are far more complex because varying multiples of interacting genes are responsible for the disease. Whole-genome pattern discovery finds correlated SNPs regardless of genomic distance and therefore can detect these gene interactions. This capability will greatly accelerate the exploitation of the genome for healthcare purposes.
The successful and efficient use of extensive genomic and proteomic data, as well as the successful and beneficial application of such data to the diagnosis, treatment, and therapeutic benefit of humans, as well as other animals, requires methods that can both identify and resolve patterns, conflicts and signals within such data. The present invention meets these needs.