The purpose of genetics classification is to be able to accurately classify individuals into one of a plurality of trait classes (e.g. brown, blue, green, etc.) associated with a particular genetic trait (e.g. eye color). The present application relates to the use of complex genetics analysis and software to create or construct accurate genetics classification tests. Such classification tests have valuable applications, especially in the fields of personalized medicine and criminal forensics.
Human beings differ only by up to 0.1% of the three billion letters of DNA present in the human genome. Though we are 99.9% identical in genetic sequence, it is the 0.1% that determines our uniqueness. Our individuality is apparent from visual inspection—almost anyone can recognize that people have different facial features, heights and colors, and that these features are, to some extent, heritable (e.g. sons and daughters tend to resemble their parents more than strangers do).
Few realize, however, that our individuality extends to an ability or inability to respond to and metabolize particular drugs. Drugs are referred to as “xenobiotics” because they are chemical compounds that are not naturally found in the human body. Xenobiotic metabolism genes make proteins whose sole purpose is to detoxify foreign compounds present in the human body, and they evolved to allow humans to degrade and excrete harmful chemicals present in many foods (such as tannins and alkaloids from which many drugs are derived).
Because variability in drug metabolism enzyme sequences is known to explain most of the variability in drug response, it can be tested whether single nucleotide polymorphisms (SNPs) within the common xenobiotic metabolism genes are linked to variable drug response. To do this, thousands of SNP markers in hundreds of xenobiotic metabolism genes can be surveyed. From learning why some people respond well to a drug (i.e. they have certain SNPs) while others do not (i.e. they do not have the certain SNPs), classifier tests can be developed. Classifier tests include chemicals called “probes” that help determine the sequence of a person at the SNP letters. The classifier test can determine the suitability of the patient for a drug before it is ever prescribed. This is commonly referred to as a “personalized drug prescription”.
Detailed analyses of SNPs and haplotype systems are required prior to developing these tests. A “haplotype system” is a coined term in the present application which describes the set of diploid (2 per person) phase-known haplotype combinations of alleles for a given set of SNP loci. A haplotype may be viewed as a particular gene flavor. Just as there are many flavors of candy in a candy store, there are many gene flavors in the human population. “Phase” refers to a linear string of sequence along a chromosome. Humans have two copies of each chromosome, one derived from the mother and one derived from the father.
Assume that a person has, in their genome, the diploid sequences shown below in Text Illustration 1.
TEXT ILLUSTRATION 1A hypothetical string of DNA sequence in ahypothetical person.Position1 2 3 4 5 6 7 8 9 10 11 12 13 14 Person 1:A G T C T G C C C C A T G G A C T C T G C C C A A T G GThe “sense strand” is shown for both the paternal and maternal chromosome. This pair of sequences is called a diploid pair which represents a small segment of the three billion nucleotide letters that make up the individual's genome. Positions 2 and 8 indicate positions where people (and in fact this person) exhibit variability. Each position of variability is known as a SNP (single nucleotide polymorphism), and there are two of them shown in Text Illustration 1. Assume that positions 2 and 8 are the only SNPs in this region of the human genome. In this case, people are identical in genetic sequence at all other letters in the string. Thus, in the entire human race, only an A is observed at position 1, either a G or a C at position 2, only a T at position 3, and so on. By convention, person 1 is called a G/C heterozygote at SNP1 and a C/A heterozygote at SNP2.
Text Illustration 1 can be re-written as shown below in Text Illustration 2.
TEXT ILLUSTRATION 2A more convenient way to represent Person 1 thanText Illustration 1, where only the variablenucleotides are shown. The GC refers to the se-quence of Person 1's maternal chromosome (read-ing the sense strand only) and the CA refers tothe sequence of Person 1's paternal chromosome(reading the sense strand only).Person 1:GC CA
In Text Illustration 2, the non-SNP nucleotide positions are omitted for convenience. Text Illustration 2 conveys every bit as much information about the sequence of Person 1 as does Text Illustration 1, because it is assumed in genetics that unwritten nucleotides are not variable. Although there are seven nucleotide letters in between SNP 1 (at position 2) and SNP 2 (at position 8), they are the same in everybody and are therefore already known by de facto by reference to the consensus human genome sequence for the region represented by the sequence.
The genotype in Text Illustration 2 can be represented in even another way shown below in Text Illustration 3.
TEXT ILLUSTRATION 3Haplotype pair as written by convention forPerson 1.Person 1:GC/CAThe sequences GC and CA are called haplotypes. Person 1, as does everyone, has two haplotypes: one GC haplotype and one CA haplotype. Thus, this individual can be referred to as a GC/CA individual. One haplotype is derived from the mother (maternal) and the other is derived from their father (paternal). It is not known from this representation whether the paternal haplotype is the GC or the CA haplotype.
When a scientist reads genetic data from people, they generally only read the positions that are different from person to person. This process is called “genotyping”. Although it would be very convenient to read that person 1 has a GC sequence in this region of their maternal chromosome and a CA sequence at their paternal chromosome, it is most practical technically to read the diploid pair of nucleotide letters at SNP 1 and the diploid pair of letters at SNP2 independently.
What a scientist reads, therefore, is shown below in Text Illustration 4.
TEXT ILLUSTRATION 4Genotype reading from Person 1.The person has a G and a C at SNP1, and a C andan A at SNP2.Person 1:SNP1: (G/C)SNP2: (C/A)From Text Illustrations 1, 2, and 3 it can be seen that the person is a GC/CA individual, as written by genetic convention. From the representation shown in Text Illustration 4, however, this is more difficult to identify since the SNP genotypes can be combined in several different ways. For example, it is not known whether the individual has the GC/CA haplotype pair or the GA/CC haplotype pair; all that is known is that the individual has a G and C at SNP1 and a C and A at SNP2. It is possible, however, to use well-known statistical methods to infer that the person indeed harbors the GC/CA haplotype pair rather than the GA/CC pair (i.e. Stephens, M., Smith, N. and P. Donnelly. (2001). A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68:978-989.). So inferring, Text Illustration 4 contains every bit of information as do Text Illustrations 1 through 3. The genotypes shown in Text Illustration 4 are called “phase-unknown” genotypes because it is not clear (before inference) whether the SNP genotypes are components of GC/CA or GA/CC haplotype pairs. After the phase has been determined as GC and CA, each haplotype is referred to as a “phase-known” genotype pair.
By definition, haplotypes are comprised of phase-known genotype combinations. Haplotype pairs are comprised of pairs of phase-known genotype combinations. In the example given (Text Illustrations 1-4), there are 2 SNPs within a stretch of 14 nucleotide letters of DNA from a particular segment of the genome. In actual practice, however, genes are much longer than 14 nucleotide letters long and a SNP is generally found once every few hundred nucleotide letters.
Regardless of its length in nucleotide letters, a gene containing 4 SNPs has a large number of 2-locus haplotype systems, a smaller number of 3-locus haplotype systems, and one 4 locus haplotype system. In FIG. 1, a gene 100 with a plurality of SNPs 102 is illustrated in a second example to help describe the concepts regarding a haplotype system. In this second example, gene 100 is one thousand nucleotides long and shown as a horizontal block. Arrows which extend from SNPs 102 to gene 100 identify four nucleotide positions within the gene sequence that may be different in different individuals. On the other hand, the remaining 996 nucleotides are identical in different individuals of the world population. Virtually all known SNP loci are bi-allelic, meaning that there are only two possible nucleotides found at that position in the population.
For the purposes of this example, the bi-allelic sites will be defined as SNP1=(A/T), SNP2=(G/A), SNP3=(C/T) and SNP4=(C/T). Given the laws of probability, this gene 100 has 24≐16 possible haplotype systems. One of these haplotype systems is:                SNP1:SNP2:SNP3:SNP4which is a four-locus haplotype system. Given that SNP1=(A/T), SNP2=(G/A), SNP3=(C/T), and SNP4=(C/T), there are several constituent haplotypes that are part of this haplotype system. For example:        
AGCC AGTT TGCC etc.
Another haplotype system (a two-locus system) is:                SNP2:SNP4Given that SNP1=(A/T), SNP2=(G/A), SNP3=(C/T) and SNP4=(C/T), there are several constituent haplotypes that are part of this particular haplotype system:        
GC GT AC ATEach one of these haplotype systems has many different haplotype constituents that can be combined into an even larger number of haplotype pairs. For example, the SNP2:SNP4 haplotype system contains the GC/GC pair, the GC/GT pair, the GC/AC pair, etc.
Because dispersive genetic forces such as recombination have shaped the genetic structure of the population, the sequence at one SNP is assumed to be independent of the sequence at other SNPs as a base assumption. This means that there are several possible haplotypes in the population of human beings for an N-locus haplotype system. In fact, from probability theory there are 2N possibilities. For example, for a four-locus haplotype system where position 1 is A/T, position 2 is G/A, position 3 is C/T, and position 4 is C/T, there are 24=16 possibilities:
AGCC, AGCT, AGTC, AGTT, AACC, AACT, AATC, AATT TGCC, TGCT, TGTC, TGTT, TACC, TACT, TATC, TATTIn actual practice, however, there are usually fewer haplotypes in the population than one would expect because systematic genetic forces (such as population bottlenecks, random genetic drift and selection) have also contributed to shape the structure of our population. This complication will be ignored as it does not significantly impact the present analysis.
As described earlier, a given individual has both a maternal and paternal copy of each chromosome to form a diploid pair. The genotype of any human being, with respect to the haplotype system, is written as a pair. A person written as AGCC/TATT, for example, contains one haplotype derived from the father and one from the mother. Since there are 16 possible haplotypes, there areΣ[(n)+(n−1)]=124possible diploid haplotype combinations in the human population. Thus, from 4 SNPs, we see how there can be 124 types of people in the population; some are AGCC/AGCC, others are AGCC/AGCT, others AGCC/AGTT, and so on. When the number of SNPs is larger than 4, the numbers quickly become unmanageable. For example, if there are 8 SNPs in a gene, there are 256 possible haplotypes and several thousands of possible pairs of haplotypes in the population.
Using conventional analysis, scientists can sometimes determine whether a given haplotype system is useful for predicting disease status by determining whether trait-affected and non-affected individuals have different haplotypes for a given haplotype system. For example, consider a haplotype system with the possible values GC, GA, CA, CC. If a scientist notes that people who respond well to an anti-cancer drug always have the GC/GC haplotype pair, this scientist has identified the GA, CA and CC haplotypes as risk markers for non-response to the drug. However, this is a relatively simple haplotype system having only four constituents.
Now consider a ten SNP haplotype system where one SNP is the cause of a non-response trait. Referring to FIG. 2, haplotype pair data 200 from four people for an eight haplotype system in a region of the genome relevant to an anti-cancer drug response are shown. Each of these positions illustrates a bi-allelic variant within a larger block of DNA sequence. The nucleotide letters that are the same from person to person are omitted by convention. The letters in column 2 for persons 1 and 3 denote sequence variants 202 (C/C) that causes a non-response to the anti-cancer drug. Response status is shown in the last column.
The four person group of data shown in FIG. 2 may be representative of a larger group of patients. Conventionally, a scientist would first obtain genotypes for each patient at these ten positions and infer haplotypes for these persons as shown in FIG. 2. The scientist would then segregate responders from non-responders and measure whether there were statistically significant differences in haplotype constitution between the two groups. In the example of FIG. 2, persons 2 and 4 would be in the responder group and persons 1 and 3 would be in the non-responder group. Visually comparing the two groups, it is apparent that only position 2 sequences are distinctive between them: non-responders have 2 C's at position 2 and responders have another combination, such as G/G, while the sequence for the other positions is not different between the groups.
Under conventional analysis, however, most genetics researchers do not work at the level of the gene haplotype. About three quarters of researchers who study genetic variation focus on individual SNPs and attempt to draw associations between SNP genotypes and traits. This is called a simple genetics approach, with which there are two problems. First; these studies generally suffer from lack of statistical power to detect associations, a power that is imparted to haplotype studies by systematic genetic forces that have shaped the genetic structure of our modern day population. Second, they are inappropriate for solving complex genetic issues. Because most human traits are complex functions of intergenic (sets of SNPs and ploidy issues) and intragenic (i.e. multiple gene-gene interactions) factors, this is a serious limitation.
On the other hand, about one quarter of geneticists perform their work at higher levels of complexity. These geneticists consider genetic determinants at the level of the haplotype, rather than the SNP, and infer phase using computational methods or directly through biochemical means. Regardless of how phase is determined, haplotype systems are usually defined based on convenience. If a gene has 30 SNPs distributed throughout its sequence, for example, a researcher would likely select a small number of these SNPs as components of a haplotype system for study. This selection process is sometimes based on whether the SNP causes a coding (amino acid) change in the expressed protein, or rather based on the fact that the chosen SNPs cover the gene sequence well from 5′ to 3′ end. The problem with this approach is that it is somewhat arbitrary and leaves most of the SNPs in the gene untested even though they may be linked to the trait under study.
Most human genes have about 30-50 SNPs. Thus, if variants for such a gene were the cause of the non-response trait, and this variability could be ascribed to one or two SNPs, most of the haplotype systems chosen for study would be worthless for predicting the trait (given the laws of probability). In other words, the constituent haplotypes would not be statistically associated with the trait. (The latter point is slightly complicated by a concept called linkage disequilibrium, but it does not significantly impact the argument presented.) This follows from the observation that there are a large number of possible haplotypes incorporating these SNPs (i.e. 230-250, 30 and 50 SNP haplotype systems, respectively) and an even larger number of haplotype pairs in the human population for each gene.
What this means for scientists trying to solve vexing disease and drug-response traits is that there is a large amount of data to sift through in drawing statistical associations between haplotypes, or haplotype pairs, and commercially relevant human traits. For most human genes, the number of haplotype systems that could possibly be invoked to explain variable traits in the human population is far larger than the number that actually explain them. This poses a tremendous statistical barrier for current day genetic research. Furthermore, traits are oftentimes caused by several genes interacting together (i.e. they are “complex”). After identifying optimal haplotype systems within a plurality of genes, the question then becomes how all of these genes work together to cause the trait.
Eye Color. Iris pigmentation is a complex genetic trait that has long interested geneticists and anthropologists but is yet to be completely understood. Eumelanin (brown pigment) is a light absorbing polymer synthesized in specialized lysozomes called melanosomes in a specialized cell type called melanocytes. Within the melanosomes, the tyrosinase (TYR) gene product catalyzes the rate-limiting hydroxylation of tyrosine (to 3,4-dihydroxyphenylanine or DOPA) and oxidation of the resulting product (to DOPAquinone) to form the precursor for eumelanin synthesis. Though centrally important, pigmentation in animals is not simply a Mendelian function of TYR (or any other) gene sequences. In fact, study of the transmission genetics for pigmentation traits in man and various model systems suggests that variable pigmentation is a function of multiple, heritable factors whose interactions appear to be quite complex (Akey et al., 2001; Brauer and Chopra, 1978; Bito et al., 1997; Sturm et al., 2001; Box et al., 1997; Box et al., 2001a). For example, unlike human hair color (Sturm et al., 2001), there appears to be no dominance component for mammalian iris color determination (Braier and Chopra, 1978), and no correlation between skin, hair and iris color within or between individuals of a given population. In contrast, between-population comparisons show good concordance; populations with darker average iris color also tend to exhibit darker average skin tones and hair colors. These observations suggest that the genetic determinants for pigmentation in the various tissues are distinct, and that these determinants have been subject to a common set of systematic forces that have shaped their distribution in the worlds various populations.
At the cellular level, variable iris color in healthy humans is the result of the differential deposition of melanin pigment granules within in a fixed number of stromal melanocytes in the iris (Imesch et al., 1997). The density of granules appears to reach genetically determined levels by early childhood and usually remains constant throughout later life (though, see Bito et al., 1997). Pedigree studies in the mid-seventies suggested iris color variation is a function of two loci; a single locus responsible for de-pigmentation of the iris, not affecting skin or hair, and another pleiotropic gene for reduction of pigment in all tissues (Brues, 1975).
Most of what has been learned about pigmentation has been derived from molecular genetics studies of rare pigmentation defects in man and model systems such as mouse and Drosophila. For example, dissection of the oculocutaneous albinism (OCA) trait in humans has shown that most pigmentation defects are due to lesions in one gene (TYR) resulting in their designation as tyrosinase (TYR) negative OCAs (Oetting and King, 1999; Oetting and King, 1993; Oetting and King, 1992; Oetting and King, 1991; see Albinism database at the World Wide Web address cbc.umn.edu/tad/). TYR catalyzes the rate-limiting step of melanin biosynthesis and the degree to which human irises are pigmented correlates well with the amplitude of TYR message levels (Lindsey et al., 2001). Nonetheless, the complexity of OCA phenotypes has illustrated that TYR is not the only gene involved in iris pigmentation (Lee et al., 1994). Though most TYR-negative OCA patients are completely de-pigmented, dark-iris albino mice (C44H), and their human type IB oculocutaneous counterparts exhibit a lack of pigment in all tissues except for the iris (Schmidt and Beermann, 1994). Study of a number of other TYR-positive OCA phenotypes have shown that, in addition to TYR, the oculocutaneous 2 (OCA2) (Durham-Pierre et al., 1994; Durham-Pierre et al., 1996; Gardner et al., 1992; Hamabe et al., 1991), tyrosinase like protein (TYRP1) (Chintamaneni et al., 1991; Abbott et al., 1991; Boissy et al., 1996), melanocortin receptor (MC1R) (Robbins et al., 1993; Smith et al., 1998; Flanagan et al., 2000) and adaptin 3B (AP3B) loci (Ooi et al., 1997), as well as other genes (reviewed by Sturm 2001) are necessary for normal human iris pigmentation. In Drosophila, iris pigmentation defects have been ascribed to mutations in over 85 loci contributing to a variety of cellular processes in melanocytes (Ooi et al., 1997; Lloyd et al., 1998) but mouse studies have suggested that about 14 genes preferentially affect pigmentation in vertebrates (reviewed in Strum 2001), and that disparate regions of the TYR and other OCA genes are functionally inequivalent for determining the pigmentation in different tissues.
Though the pigmentation genes are well-documented, until this work, merely a handful of SNP alleles were known to be weakly associated with natural distributions of iris colors in the healthy Caucasian population. The reason for this is that most work attempting to describe natural variation in iris colors has focused on simple genetics approaches, such as single SNP analysis in single genes including the TYR 0, MC1R (Valverde et al., 1997) and ASIP ( ) genes. By developing new complex genetics methodologies and adopting a systematic approach for identifying and modeling genetic features of variable iris color, the problem was analyzed through more of a complex genetics lens than others previously. Nevertheless, most of the results agree with previous literature.
Though the TYR expression product is the rate-limiting step in the catalytic chain leading to the synthesis of eumelanin from tyrosine, previous studies by others have belied the “simplistic” hypothesis that TYR polymorphism is a principle (i.e. penetrant) component underlying normal variation of human pigmentation (Strum). Our study also failed to identify penetrant genetic features of variable iris color in the TYR gene. In addition, the systematic approach for identifying penetrant genetic features independently confirmed that the “red hair” SNP alleles described by Valverde et al., 1995 and Koppula et al., 1997 are indeed associated with iris colors. However, even these simple gene-wise analyses has been extended by the present findings. While there are no SNPs or haplotypes within the TYR gene associated with iris color, TYR alleles are important within a complex genetics context for the inference of iris colors. While the two “red hair” SNPs are indeed associated with natural iris colors (in Irish individuals), they seem to be most strongly associated with Caucasian iris colors within the multilocus context of another coding change in the MC1R gene, and even then, they represent merely one stroke of a larger portrait.
In fact, one important point to be taken from the work described herein is that speaking of variable iris color on the level of individual genes is illogical due to the complexity of the trait. The fact of the matter is, neither TYR nor MC1R, nor for that matter any of the other genes we surveyed, are very important for predicting iris colors on their own. This was indicated by the Bayesian conditional probabilities obtained, which for even the most strongly associated alleles (the penetrant genetic features), were too low for their use as independent classifiers. Since the variance of any complex phenotype is a function of additive, dominance, and epistatic genetic variance (in addition to environmental variance) any good complex genetics classifier must capture each of these three components when making inferences, and the present classifier developed seems to be able to this. The additive component is captured most efficiently through the analysis of multilocus alleles (haplotypes) and the dominance component is captured by expressing individuals as vectors whose components are encodings of multilocus genotypes for each important region.
Though research on pigment mutants has made clear that a small subset of genes is largely responsible for catastrophic pigmentation defects in mice and man, it remains unclear whether or how common SNPs in these genes contribute towards (or are linked to) natural variation in human iris color. A brown-iris locus was localized to an interval containing the MC1R gene (Eiberg and Mohr, 1996), and specific polymorphisms in the MC1R gene have been shown to be associated with red hair and blue iris color in relatively isolated Irish populations (Robbins et al., 1993; Smith et al., 1998; Flanagan et al., 2000; Valverde et al., 1995; Koppula et al., 1997). An ASIP polymorphism was also recently described that may be associated with both brown iris and hair color (Kanetsky et al., 2002). However, the penetrance of each of these alleles is low and in general, they appear to explain but a very small amount of the overall variation in iris colors within the human population (Spritz et al., 1995). Studies such as these for associating genes and traits are gene-centric in that alleles descriptive of variant gene loci are considered as definitive and focal objects.
To date, these methods have not worked well. Because most human traits are complex and genetic wholes are often times greater than the sum of its parts, innovative genomics-based study designs and analytical methods for screening genetic data in-silico are needed that are respectful of genetic complexity (for example, the components of dominance and epistatic genetic variance).
Correspondence Analysis. As a methodology for multidimensional analysis, one might consider using correspondence analysis (COA) to find relationships between haplotype systems in various genes and genetic traits. COA is used to create a spatial representation of a data matrix in such a manner that associations within and between variables can be discerned. COA has been described by various authors, most notably by J. P. Benzecri in his “Correspondence Analysis Handbook” published in 1992 (Statistics: textbooks and monographs, Volume 125, Marcel Dekker, Inc., New York, N.Y.) and by Greenacre, M. J. in his “Theory and application of correspondence analysis” handbook (Academic, London, 1st Edition). The methods described by Messrs. Benzecri and Greenacre are applicable to various data having non-negative counts and non-negative continuous measurements. Special considerations and approaches, however, must be made for the analysis of genomics data, and specifically for population genetic data.
COA generally provides the canvas upon which various interpretations can be painted. Various discriminates have been used with COA plots in order to formulate rules for making predictions. For example, in one study of medical relevance, clouds of data were generated for patients receiving a particular therapy, conforming to various attribute values of medical relevance. Patient survival was one of the axes of a plot of variable profiles whose simplex lines were well correlated with this axis. The goal of the study was to enable the classification of a patient based on a COA of various qualitative and quantitative attributes into the cloud of patients to which the individual was most similar so that its survival “value” given the therapy could be learned.
Within the field of molecular biology, several authors have used COA or similar methods for drawing associations between gene expression and cellular state. For example, see Fellenberg, K. et al. Correspondence analysis applied to microarray data. PNAS 98(19):10781-10786; and Alter, O. et al., Singular value decomposition for genome-wide expression data processing and modeling. PNAS 97(18):10101-10106). These applications required various normalization routines in order to avoid biasing the analysis by considering genes expressed with vastly different amplitudes. Only Alter, Patrick Brown, and David Bostein applied a singular value decomposition method for an analysis of gene expression data. Their method used scaled down dimensions of complex data by decomposition onto principal axes. Their method showed that singular value decomposition provides a useful mathematical framework for processing and modeling genome-wide expression data, which was not directly related to population genetics where parameters are measured differently.
However, gene expression data is inherently different from population genetic data. Gene expression is a measure of amplitude, while population genetic data is a measure of state. Not only does this require different measures for standardization and normalization, but the parameters used to describe population genetic data are different. For example, linkage disequilibrium is a parameter that is only useful for describing relationships between genetic states and cannot be used for gene expression analysis. The ability to analyze encoded genetic states in terms of linkage disequilibrium constants, or other genetic parameters such as allele frequencies, haplotype cladogram positions, etc., is an important feature which differs significantly from previous applications of COA in biology. Gene expression analysis also requires a filtration of insignificant “eigengenes” or rows of genes that do not differ significantly along columns (hybridization or cellular states). Compare this to an application of COA as a modeling tool for genetic factors that are already known from other analytical techniques to be features of phenotype states—that is, row values are already known to not be independently distributed with respect to column values.
Good computational tools for genetic modeling do not currently exist, and it is this need that is addressed by the inventive methods and apparatus described in the present application.