1. Field of the Invention
The present invention relates generally to methods for identifying genetic features of a particular complex genetic trait, and more particularly to software-based methods which utilize statistical analyses for identifying one or more haplotype systems, alleles of which are useful for predicting a particular complex genetic trait.
2. Description of the Related Art
Human beings differ only by up to 0.1% of the three billion letters of DNA present in the human genome. Though we are 99.9% identical in genetic sequence, it is the 0.1 % that determines our uniqueness. Our individuality is apparent from visual inspection—almost anyone can recognize that people have different facial features, heights and colors, and that these features are, to some extent, heritable (e.g. sons and daughters tend to resemble their parents more than strangers do).
Few realize, however, that our individuality extends to our disease status, or an ability or inability to respond to and metabolize particular drugs. Drug-reaction traits are only one example of a complex genetic trait Drugs are referred to as “xenobiotics” because they are chemical compounds that are not naturally found in the human body. Xenobiotic metabolism genes make proteins whose sole purpose is to detoxify foreign compounds present in the human body, and they evolved to allow humans to degrade and excrete harmful chemicals present in many foods (such as tannins and alkaloids from which many drugs are derived).
Because variability in drug metabolism enzyme sequences is known to explain most of the variability in drug response, it can be tested whether single nucleotide polymorphisms (SNPs) within the common xenobiotic metabolism genes are linked to variable drug response. To do this, thousands of SNP markers in hundreds of xenobiotic metabolism genes can be surveyed. From learning why some people respond well to a drug (i.e. they have certain SNPs) while others do not (i.e. they do not have the certain SNPs), classifier tests can be developed. Classifier tests include chemicals called “probes” that help determine the sequence of a person at the SNP locus. The classifier test can determine the suitability of the patient for a drug before it is ever prescribed. This is commonly referred to as a “personalized drug prescription”.
Detailed analysis of SNPs and haplotype systems are required prior to developing these tests. A “haplotype system” is a coined term in the present application which describes the set of diploid (2 per person) phase-known haplotype combinations of alleles for a given set of SNP loci in the world population. A haplotype may be viewed as a particular gene flavor. Just as there are many flavors of candy in a candy store, there are many gene flavors in the human population. “Phase” refers to a linear string of sequence along a chromosome. Humans have two copies of each chromosome, one derived from the mother and one derived from the father.
Assume that a person has, in their genome, the diploid sequences shown below in Text Illustation 1.
Position1 2 3 4 5 6 7 8 9 10 11 12 13 14Person 1:A G T C T G C C C C A T G CA C T C T G C C C A A T G GText Illustration 1. A Hypothetical String of DNA Sequence in a Hypothetical Person.
The “sense strand” is shown for both the paternal and maternal chromosome. This pair of sequences is called a diploid pair which represents a small segment of the three billion nucleotide letters that make up the individual's genome. Positions 2 and 10 indicate positions where people (and in fact this person) exhibit variability. Each position of variability is known as a SNP (single nucleotide polymorphism), and there are two of them shown in Text Illustration 1. Assume that positions 2 and 10 are the only SNPs in this region of the human genome. In this case, people are identical in genetic sequence at all other letters in the string. Thus, in the entire human race, only an A is observed at position 1, either a G or a C at position 2, only a T at position 3, and so on. By convention, person 1 is called a G/C heterozygote at SNP1 and a C/A heterozygote at SNP2.
Text Illustration 1 can be re-written as shown below in Text Illustration 2.                Person 1: GC                    CA                        
Text Illustration 2. A more convenient way to represent Person 1 than Text Illustration 1, where only the variable nucleotides are shown. The GC refers to the sequence of Person 1's maternal chromosome (reading the sense strand only) and the CA refers to the sequence of Person 1's paternal chromosome (reading the sense strand only).
In Text Illustration 2, the non-SNP nucleotide positions are omitted for convenience. Text Illustration 2 conveys every bit as much information about the sequence of Person 1 as does Text Illustration 1, because it is assumed in genetics that unwritten nucleotides are not variable. Although there are seven nucleotide letters in between SNP 1 (at position 2) and SNP 2 (at position 10), they are the same in everybody and are therefore already known by de facto.
The genotype in Text Illustration 2 can be represented in even another way shown below in Text Illustration 3.                Person 1: GC/CAText Illustration 3. Haplotype pair as written by convention for Person 1.        
The sequences GC and CA are called haplotypes. Person 1, as does everyone, has two haplotypes=1 GC haplotype and 1 CA haplotype. Thus, this individual can be referred to as a GC/CA individual One haplotype is derived from the mother (maternal) and the other is derived from their father (paternal). It is not known from this representation whether the paternal haplotype is the GC or the CA haplotype.
When a scientist reads genetic data from people, they generally only read the positions that are different from person to person. This process is called “genotyping”.
Although it would be very convenient to read that person 1 has a GC sequence in this region of their maternal chromosome and a CA sequence at their paternal chromosome, it is most practical technically to read the diploid pair of nucleotide letters at SNP 1 and the diploid pair of letters at SNP2 independently.
What a scientist reads, therefore, is shown below in Text Illustration 4                Person 1: SNP1: (G/C) SNP2: (C/A)Text Illustration 4. Genotype Reading from person 1. The Person has a G and a C at SNP1, and a C and an A at SNP2.        
From Text Illustrations 1, 2, and 3 it can be seen that the person is a GC/CA individual, as written by genetic convention. From the representation shown in Text Illustration 4, however, this is more difficult to identify since the SNP genotypes can be combined in several different ways. For example, it is not known whether the individual has the GC/CA haplotype pair or the GA/CC haplotype pair; all that is known is that the individual has a G and C at SNP1 and a C and A at SNP2. It is possible, however, to use well-known statistical methods to infer that the person indeed harbors the GC/CA haplotype pair rather than the GA/CC pair. So inferring, Text Illustration 4 contains every bit of information as do Text Illustrations 1 through 3. The genotypes shown in Text Illustration 4 are called “phase-unknown” genotypes because it is not clear (before inference) whether the SNP genotypes are components of GC/CA or GA/CC haplotype pairs. After the phase has been determined as GC and CA, each haplotype is referred to as a “phase-known” genotype pair.
By definition, haplotypes are comprised of phase-known genotype combinations. Haplotype pairs are comprised of pairs of phase-known genotype combinations. In the example given (Text Illustrations 1–4), there are 2 SNPs within a stretch of 14 nucleotide letters of DNA from a particular segment of the genome. In actual practice, however, genes are much longer than 14 nucleotide letters long and a SNP is generally found once every few hundred nucleotide letters.
Regardless of its length in nucleotide letters, a gene containing 4 SNPs has a large number of 2-locus haplotype systems, a smaller number of 3-locus haplotype systems, and one 4 locus haplotype system. In FIG. 1, a gene 100 with a plurality of SNPs 102 is illustrated in a second example to help describe the concepts regarding a haplotype system. In this second example, gene 100 is one thousand nucleotides long and shown as a horizontal block. Arrows which extend from SNPs 102 to gene 100 identify four nucleotide positions within the gene sequence that may be different in different individuals. On the other hand, the remaining 996 nucleotides are identical in different individuals of the world population. Virtually all known SNP loci are bi-allelic, meaning that there are only two possible nucleotides found at that position in the population.
For the purposes of this example, the bi-allelic sites will be defined as SNP1=(A/T), SNP2=(G/A), SNP3=(C/T) and SNP4=(C/T). Given the laws of probability, this gene 100 has
            ∑              J        =        2            n        ⁢                  ⁢          C      j                            n        ,          ⁢            where      ⁢                          ⁢              C        j                                    n              =                  n        !                              j          !                ⁢                                  ⁢                              (                          n              -              j                        )                    !                    possible n-locus haplotype systems, where n>1. One of these haplotype systems is:                SNP1: SNP2: SNP3: SNP4which is a four-locus haplotype system. Given that SNP1=(A/T), SNP2=(G/A), SNP3=(C/T), and SNP4=(C/T), there are several constituent haplotypes that are part of this haplotype system. For example:        AGCC        AGTT        TGCC        etc.        
Another haplotype system (a two-locus system) is:                SNP2:SNP4Given that SNP1=(A/T), SNP2=(G/A), SNP3=(C/T) and SNP4=(C/T), there are several constituent haplotypes that are part of this particular haplotype system:        GC        GT        AC        ATEach one of these haplotype systems has many different haplotype constituents that can be combined into an even larger number of haplotype pairs. For example, the SNP2:SNP4 haplotype system is represented within individuals (according to the laws of independent assortment) as the GC/GC pair, the GC/GT pair, the GC/AC pair, etc.        
Ignoring dispersive genetic forces such as recombination and mutation which have shaped the genetic structure of the population, the sequence at one SNP is assumed to be independent of the sequence at other SNPs. This means that there are several possible haplotypes in the population of human beings for an N-locus haplotype system. In fact, from probability theory there are 2N possibilities. For example, for a four-locus haplotype system where position 1 is A/T, position 2 is G/A, position 3 is C/T, and position 4 is C/T, there are 24=16 possibilities:
AGGC, AGCT, AGTC, AGTT, AACC, AACT, AATC, AATTTGCC, TGCT, TGTC, TGTT, TACC, TACT, TATC, TATTIn actual practice, however, there are usually fewer haplotypes in the population than one would expect because systematic genetic forces (such as population bottlenecks, random genetic drift and selection) have contributed to shape the structure of our population. This complication is important for the process of haplotype inference, but will be ignored as it does not significantly impact the present analysis.
As described earlier, a given individual has both a maternal and paternal copy of each chromosome to form a diploid pair. The genotype of any human being, with respect to the haplotype system, is written as a pair. A person written as AGCC/TATT, for example, contains one haplotype derived from the father and one from the mother. Since there are 16 possible haplotypes, there aren+[n!/(r!×(n−r)!)](where n=the number of haplotypes, and r=2 for pairs) possible diploid haplotype combinations in the human population. Thus, from 4 SNPs, we see how there can be 124 types of people in the population; some are AGCC/AGCC, others are AGCC/AGCT, others AGCC/AGTT, and so on. When the number of SNPs is larger than 4, the numbers quickly become unmanageable. For example, if there are 8 SNPs in a gene, there are 256 possible haplotypes and several thousands of possible pairs of haplotypes in the population.
Using conventional analysis, scientists can sometimes determine whether a given haplotype system is useful for predicting disease status by determining whether trait-affected and non-affected individuals have different haplotypes for a given haplotype system. For example, consider a haplotype system with the possible values GC, GA, CA, CC. If a scientist notes that people who respond well to an anti-cancer drug always have the GC/GC haplotype pair, this scientist has identified the GA, CA and CC haplotypes as risk markers for non-response to the drug. However, this is a relatively simple haplotype system having only four constituents.
Now consider a ten SNP haplotype system where one SNP is the cause of a non-response trait. Referring to FIG. 2, haplotype pair data 200 from four people for a ten-locus haplotype system in a region of the genome relevant to an anti-cancer drug response are shown. Each of these positions illustrates a bi-allelic variant within a larger block of DNA sequence. The nucleotide letters that are the same from person to person are omitted by convention. The letters in column 2 for persons 2 and 4 denote sequence variants 202 that causes a non-response to the anti-cancer drug. Response status is shown in the last column.
The four person group of data shown in FIG. 2 may be representative of a larger group of patients. Conventionally, a scientist would first obtain genotypes for each patient at these ten positions and infer haplotypes for these persons as shown in FIG. 2. The scientist would then segregate responders from non-responders and measure whether there were statistically significant differences in haplotype constitution between the two groups. In the example of FIG. 2, persons 2 and 4 would be in the responder group and persons 1 and 3 would be in the non-responder group. Visually comparing the two groups, it is apparent that only position 2 sequences are distinctive between them: responders have 2 G's at position 2 and non-responders have 2 C's, while the sequence for the other positions is not different between the groups.
Under conventional analysis, however, most genetics researchers do not work at the level of the gene haplotype. About three quarters of researchers who study genetic variation focus on individual SNPs and attempt to draw associations between SNP genotypes and traits. This is called a simple genetics approach, with which there are two problems. First, these studies generally suffer from lack of statistical power to detect associations, a power that is imparted to haplotype studies by systematic genetic forces that have shaped the genetic structure of our modern day population. Second, they are inappropriate for solving complex genetic issues. Because most human traits are complex functions of intergenic (sets of SNPs and ploidy issues) and intragenic (i.e. multiple gene-gene interactions) factors, this is a serious limitation.
On the other hand, about one quarter of geneticists perform their work at higher levels of complexity. These geneticists consider genetic determinants at the level of the haplotype, rather than the SNP, and infer phase using computational methods or directly through biochemical means. Regardless of how phase is determined, haplotype systems are usually defined based on convenience. If a gene has 30 SNPs distributed throughout its sequence, for example, a researcher would likely select a small number of these SNPs as components of a haplotype system for study. This selection process is sometimes based on whether the SNP causes a coding (amino acid) change in the expressed protein, or rather based on the fact that the chosen SNPs cover the gene sequence well from 5′ to 3′ end. The problem with this approach is that it is somewhat arbitrary and leaves most of the SNPs in the gene untested even though they may be linked, within the context of a specific combination, to the trait under study.
Most human genes have about 30–50 SNPs. Thus, if variants for such a gene were the cause of the non-response trait, and this variability could be ascribed to one or two SNPs, most of the haplotype systems chosen for study would be worthless for predicting the trait (given the laws of probability). In other words, the alleles from haplotypes, comprised of those SNPs, would not be statistically associated with the trait. (The latter point is slightly complicated by a concept called linkage disequilibrium, but it does not significantly impact the argument presented.) This follows from the observation that there are a large number of possible haplotypes incorporating these SNPs (i.e. 230–250, 30 and 50 SNP haplotype systems, respectively) and an even larger number of haplotype pairs in the human population for each gene. The reason why single-SNP analysis should not be relied upon is that SNP alleles may be more rigorously associated with a trait within the context of a combination of other SNPs rather than on its own (which is frequently found to be the case), due to the genetic structure of the population.
What this means for scientists trying to solve vexing disease and drug-response traits is there is a large amount of data to sift through in drawing statistical associations between haplotypes, or haplotype pairs, and commercially relevant human traits. For most human genes, the number of haplotype systems that could possibly be invoked to explain variable traits in the human population is far larger than the number that actually explain them. This poses a tremendous statistical barrier for current day genetic research.
As apparent, a significant problem with conventional methods is that there is no logic or computer software that exists to predict which sets of SNPs define the optimal haplotype system for understanding the trait. In some cases, a short haplotype system may prove optimal. In other cases, a long haplotype system may prove optimal. In either case, there is no way to predict which will be the case.
A long haplotype system may best explain the variability in a certain trait due to the complexity of the trait. For example, assume a trait is associated with and caused by the coincidence of 4 minor SNP variants such that a haplotype with minor alleles at (at least) any two of these four SNP positions is required in order for the trait to be expressed, and only people with the haplotype comprised of at least 2 minor alleles at these SNP locations reveal the trait. Also assume that research scientists are trying to understand the genetics of this trait. The scientists know there are 15 SNPs in this gene, but due to the large number of possible haplotype systems they have randomly chosen only a few for analysis.
Further assume that one of these chosen haplotype systems has only 2 of the 4 SNPs. When the trait-affected and non-affected groups are partitioned, and the haplotype constitution of each group is visually inspected, they would indeed notice that minor alleles for these 2 SNPs were found only in the affected group. Also, there would be many affected that did not have minor alleles at these 2 SNP locations, or had minor alleles at only one of the 2 SNP locations. In fact, because it is known that at least 2 minor alleles at the 4 SNP locations are required for the affected status, these individuals must have minor alleles at one or both of the other 2 SNPs that were not part of the haplotype system. In this case, a longer more complicated haplotype system would be optimal for describing the relationship between the gene and the trait.
On the other hand, a short haplotype system may best explain the variability of certain traits for two main reasons. First, short haplotype systems have fewer possible haplotypes and fewer diploid haplotype combinations than do long haplotype systems. Geneticists do not have the luxury of genotyping whole populations and usually rely on cohorts that are representative of the population. For certain traits, these cohorts may be limited in size for several reasons. When studied with long complicated haplotype systems, these cohorts produce numerous genetic classes of sample sizes that are too small to prove that they are related to the trait. It is well known to those skilled in the art of statistical genetic analysis that, given a constant study sample size, the larger the number of possible classes, the lower the sample size within each class. Small sample sizes in haplotype classes of complicated haplotype systems could conceal a statistical relationship even if the haplotype system is the optimal system for describing the relationship of the gene with the trait. Thus, in genetics, the “statistical power” of long, complicated haplotype systems can be lower than that of smaller ones.
Secondly, short haplotype systems can more concisely explain trait variance when a specific sub-region of a gene is relevant for the trait. In this case, if a small domain of a gene causes a particular trait, a small haplotype system comprised of SNPs found within this domain would be expected to genetically define the trait better than a larger, more complicated system incorporating these same SNPs. This is because SNPs found in other regions are not relevant for the trait, and serve to only complicate the analysis. In many cases, variance among these irrelevant SNPs can statistically conceal the associations of the relevant ones.
Some geneticists work strictly within the context of “whole gene” haplotypes. A common argument for this approach is that no functionally relevant SNPs can be missed. Since both the low statistical sample size within each genotype class and the fact that irrelevant SNPs can conceal the statistical significance of relevant SNPs, this method is far from optimal. Others geneticists select SNPs that span a gene from end to end and attempt to identify functionally relevant haplotypes using an approach that tracks unseen variants embedded in the structure of a haplotype cladogram. A haplotype cladogram is an evolutionary tree describing how the haplotypes relate to one another in sequence, and over evolutionary time. Although this approach sometimes provides good results, it performs relatively sub-optimally in cases where statistical sample size is a consideration as well as in cases where the biology of the trait is a function of a small domain within the gene. It is also subject to statistical limitations imposed by the specific SNP loci selected for analysis.
Thus, identifying the set of SNPs that most efficiently explain the variance of a trait is a crucial, but non-trivial task for developing complex genetics classifiers. Haplotype systems are “genetic features” in that they can be used, to an extent, to distinguish among individuals and groups of individuals. The present application coins this term to represent haplotype systems as component pieces of a given complex genetics puzzle (i.e., a typical human trait). The best, most informative haplotype systems are crucial for any effort to identify genetic features of adequate predictive value for use in a clinically useful classifier test. Complex genetic solutions developed from sub-optimal haplotype systems (i.e. SNP combinations that explain less of the trait variance than contributed by the gene within which they are found) are restricted in utility and accuracy by the limitations of the constituent haplotype systems.
Thus, there are important reasons to find the optimal haplotype system that explains a trait for developing a classifier test. This optimal haplotype system may be a short one for certain traits and genes, but a long one for others. A haplotype system with 16 SNPs covering an entire gene may be the optimal system for a given trait and a given gene, for example, but a short 2 SNP haplotype system may be the optimal system for describing the relationship between this same gene and a different trait. In fact, there are no consistent rules a scientist can use to predict what sort of haplotype system should be selected in any given situation. The identification of the optimal haplotype system is in some ways a matter of trial and error, but given the large number of possible haplotypes for even short haplotype systems, it is not a task which should solely involve human analysis and inspection.
The difficulty is that computational tools for this process do not currently exist, and it is this need that is addressed by the inventive methods and apparatus described in the present application. On the other hand, there are various existing software applications that could serve as individual components of such a pipeline system. For example, consider the inventive “feature extraction” method. Some existing programs are designed for calculating whether alleles of a given haplotype system are useful for resolving between trait classes. For example, see Raymond, M. and F. Rousset, “An exact test for population differentiation,” 1995, Evolution 49(6), 1280–1283. However, there are no software applications which incorporate such a method into a systematic feature extraction process.
Other conventional software applications make the above-described test somewhat more convenient for the geneticist. For example, the Arlequin™ software program is one such program. These applications, however, require numerous manual manipulations. For example, the Arlequin™ program requires the user to retrieve SNP data for a given SNP combination for inspection and to create a text input file containing the genotype and phenotype data relevant for the inspection. It takes about thirty minutes, for example, for a scientist skilled in the art to retrieve this data and create the file. When the “Exact test” of the Arlequin™ program is completed, the user would have to create a second file for the next SNP combination, and so on.
Given that patients are genotyped at several tens of SNPs per gene, tens of thousands of possible SNP combinations need to be tested in order to assure that the optimal combination has been identified (assuming that a useful system for that gene does indeed exist). This would require many months of the scientist's time. Even still, this work would only address a single gene. When additional genes are added to the analysis, the process would take an average scientist years to perform using currently available software tools and algorithms. What is needed is a software pipeline system that takes care of each of these manipulations automatically. Rather than forcing a scientist to spend years creating text files and logging results, a software system is needed which performs such processing in minutes. This system should integrate a combination of statistical tests, algorithms, and software applications into an automated informatics platform.
Other components of the software system have ideological and practical counterparts in existing methodologies. One or more software-based statistical tests may be used to evaluate a haplotype system as a genetic feature. Ideas for one these tests were first propounded by Raymond and Rousset. See, e.g., Raymond, M. and F. Rousset, “An Exact Test For Population Differentiation”, Evolution 49(6), 1280–1283, 1995. As we have described earlier, however, if a scientist desired to use Raymond and Roussets' algorithm to do the type of work we have described, it would take them years to do a job that the inventive platform system would take only days to do. Ideas for another test, the F-statistic test, were first propounded by Fisher. See Fisher, R. A., “The Logic of Inductive Inference,” Journal of the Royal Statistical Society 98:39–54, 1935.
The modeling algorithms and software applications that function downstream of the haplotype feature extraction system are also novel applications of existing methods for genetic analysis. Correspondence analysis for complex genetic analysis is believed to be a novel and non-obvious methodology, although correspondence analysis has previously been used by sociologists to model sociological variables and by mechanical engineers to model physical variables. This is also true for the linear & quadratic as well as the classification tree techniques for complex genetics analysis. The process of drawing haplotype cladograms (part of a geometric modeling method) was introduced by Templeton et al., 1995. Although methods for drawing these haplotype cladograms have been previously described, it is believed that a method for encoding and plotting haplotypes in geometrical space, based on their position within a haplotype cladogram, for the extraction of complex genetics information, is also novel and non-obvious.
Other relevant publications include Shou M, Lu, T, Drausz, K., Sai, Y., Yang, T., Korzekwa, K R., Gonzalez, F., Gelboin, H., 2000, “Use of inhibitory monoclonal antibodies to assess the contribution of cytochromes P450 to human drug metabolism,” Eur J Pharmacol 394(2–3):199–209; and Dai, D., Zeldin, D C, Blaisdell, J., Chanas, B., Coulter, S., Ghanayem, B., Goldstein, J., 2001, “Polymorphisms in human CYP2C8 decrease metabolism of the anticancer drug paclitaxel and arachidonic acid,” Pharmacogenetics 11(7):597–607.
Accordingly, what are needed are methods and apparatus for quickly, efficiently, and accurately identifying associations between genetic features (e.g. haplotype systems) and genetic traits of individuals.