The purpose of genetics classification is to be able to accurately classify individuals into one of a plurality of genetic trait classes (e.g. brown, blue, green, etc.) associated with a particular genetic trait (e.g. eye color). A genetics classification test should be able to identify with precision which trait class an individual may fall into based on a genetic sample taken from the individual. The present application relates to the use of complex genetics analysis and software to create or construct accurate genetics classification tests. Such classification tests have highly valuable applications, especially in the fields of personalized medicine and criminal investigation.
The present application relates more particularly to “blind” genetics classification. “Blind genetics classification” is the classification of individual genetic samples that were not used in constructing the actual classification tool. If 1000 samples were being considered, for example, a classification model may be built from 900 of them. Classification of the 900 individuals will perform at one level (depending on the genes used), and blind classification of the remaining 100 should perform as well. However, the blind classification may not perform well at all, depending on how well the classification model generalizes.
The problem with many of the existing classification methods used in genetics or genomics analysis is that they build good models, but the models tend not to generalize very well. That is to say, the models produced are “over-fit” to the data. This is not surprising when one recognizes that these methods were not developed with the specific requirements of complex genetics analysis in mind. Linear discriminate methods and Bayesian probability models, for example, overestimate the importance of single genotype associations, which means they are sensitive to dominance and even additive issues but they ignore higher-order interactions that are referred to in genetics as non-linear/interactions or epistasis. Upon blind challenge, they therefore under perform using the same set of data when compared to the present inventive techniques described herein.
Some methods measure complex genetics parameters (the so-called “parametric” methods) and, for this reason, are encumbered with many limitations. Some of these define the additive, dominance, and interactive contribution of gene variants based on trait value. To measure these values, the programs use regression analysis. These methods build models that appear to be highly accurate when tested against samples that went into construction of the model but, for unclear reasons, they do not tend to generalize as well for blind classification than the present methods described herein, perhaps because parameter estimation is particularly sensitive to inadequate sample sizes.
Assume that the trait of skin color is a function of two genes A and B, and each gene has various forms (i.e. haplotypes) in the population A1, A2, . . . , to An, and B1, B2, . . . , to Bn. It may be that A1 always specifies dark skin, but A2 specifies dark skin when paired with B1 and light skin when paired with B2. In this case, the influence of A1 is said to be dominant and the influence of A2 is said to be interactive. Each human being has two copies of every gene. One person may have A1/A1 and B1/B1, whereas another person may have A1/A2 and B1/B3. If an individual has no copies of A3, the skin color may be darker than average; if the individual has one copy of A3, the skin color may always be medium; and if the individual has two copies of A3, the skin color may be very light. In this case, A3 is said to have an additive effect on skin color.
Even though the genes that determine a trait may be known with confidence, using them to make accurate classifications for the trait is another matter all together. By analogy, just because one has a complete set of puzzle pieces, it is not immediately clear how they should be put together to constitute an image. Most human traits are a function of additive, dominance, and interactive influences amongst several genes, and breaking the impact of genes on traits into these three influences helps geneticists understand how traits are determined by specific gene variants and combinations of gene variants. Understanding how each form of each gene participates in the determination of a trait is a fundamental goal for genetics researchers.
Knowing this, it is possible to classify an individual into trait classes. In the fields of variable drug response, or disease predisposition, such ability has enormous social and economic implications. Various methods for using gene sequences to predict traits have been previously developed, including linear discriminate analysis and Bayesian classifications. Unfortunately, these methods do little to address the subtleties of gene-by-gene influences, or fully capture the impact of individual genotypes.
Consider an example which describes why it is difficult to make genetics classifications even though the genes impacting the trait are known with confidence. In this example, two genes A and B are specified. Assume the following sample “counts” for 658 people relative to skin shade.
TABLE AGene A genotypes and skin shades in Caucasians.DarkMediumLightA1/A110143A1/A25052A1/A3102322A2/A210210159A2/A3203120A3/A324558It can be determined from this data that people with the A1 genotype usually have Dark skin, but sometimes Medium or Light, and people with A2 usually do not have Light skin. Making classifications based on this knowledge results in the misclassification of only 28 A1 individuals, but 81 A3 individuals. In this case, it is better to make classification rules based on genotypes, such as A1/A1 being not light (correct 105/108 times), A1/A2 not being light (correct 55/57 times) etc.
Now consider a B gene with the following counts:
TABLE BGene B genotypes and skin shades in Caucasians.DarkMediumLightB1/B1512433B1/B250102B1/B3301312B2/B25214169B2/B3103625B3/B3122568A consideration of such B gene variants along with A variants may enable better classification. In this case, those with a B1 tend to have Dark color and those with B3 average a Lighter color, but do the 10 B2/B2 individuals with Dark color have a particular gene A genotype that distinguishes them from other B2/B2 individuals? For real genetics problems, it is rarely the case that those that were misclassified using gene A are correctly classified using gene B—oftentimes up to 10 more genes are required to explain all of the variability in the data, which is one example of why it is difficult to make genetics classifications even though the genes impacting the trait are known with confidence.
It has been observed that often the specific combination of A and B alleles that helps make accurate classifications. However, the way these combinations relate to trait value can be unpredictable. For this reason, observation is crucial for good genetics classification, and it is upon observation that the present inventive techniques described herein rely. For example, assume that the combinations of A and B gene variants provides a table with the following counts (shown in part):
TABLE CGenotype combinations.DarkMediumLightA1/A1 + B1/B250 20A1/A3 + B1/B2 0100Etc.. . .. . .. . .From Table C, it would appear that the A1/A1+B1/B2 combination is always predictive for “not Light” and usually predictive for Dark, and the A1/A3+B1/B2 combination is always predictive for “Medium” color. One of these results is not surprising, but the other is. From Table A and B, we see that both A1/A1 and B1/B2 are linked with Dark color on their own, so it is no surprise that people with the combination A1/A1+B1/B2 almost always have Dark color. In contrast, A1/A3 appears to be linked with no color on its own, and B1/B2linked with Dark but the A1/A3+B2/B2 combination is linked with Medium color. In this case, the presence of the A1/A3 combination explains why some of the B1/B2 individuals in Table B are not Dark and the rule “the presence of B1/B2 indicates Dark unless A1/A3 is present” would have a higher blind classification accuracy than the rule “B1/B2 indicates Dark”.
When a variant pair associated with one type of extreme trait is paired with another gene variant pair linked with the same extreme trait, an intermediate trait may sometime result. Sometimes it is two variants that are not linked to the trait at all on their own that together determine a specific trait value. The interaction between gene variants to influence trait value is called epistasis. These types of unexpected results are not unusual in genetics. Other data has suggested that this type of scenario is not at all uncommon, which illustrates that the present inventive techniques described herein are an important advance.
If A's influence and B's influence are known, then how is it that the influence of A+B cannot always be predicted prior to observation? In other words, how is it possible for epistasis to exist? Most dynamic biochemical pathways and their influences are complex. The product of each gene is part of myriad complex biochemical networks, and modification of a gene product in a dynamic biochemical pathway may have a small or large effect on the function of the pathway, depending on the position of the gene in the pathway and the type of modification. Many biochemical networks intersect with others, adding to the complexity and unpredictability that modifications in one pathway can produce. Most geneticists agree that linking a gene variant to a trait depends on observation rather than conjecture or inference from biochemical research. In other words, genetics observations do not always conform to expectations, not because the observations are not accurate but because genetics is very complex. It is very advantageous to learn how these modifications and variants participate in trait formation through observation. The present inventive techniques described herein is a tool for such observation.
With most human traits, certain variant combinations (called genotypes, such as A1/A1 or A1/A3) may be highly predictive for a trait, but other combinations not predictive. Certain variants (such as A1 or A3) may be predictive on their own, to varying extents specific to the variant, while others may not be. Certain combinations of variant combinations (genotypes) may be linked to a trait. However, each of the genotypes may not be linked to the trait, or linked in ways that are expected based on the combination linkage, or linked to in ways one would not expect based on the combination linkage. These are the complex issues that population geneticists must contend with when attempting to make practical applications of their research.