Field of the Invention
The present invention relates to DNA sequencing and more particularly to interpretation of the many genetic variants generated in any sequencing project.
An automated computational system for producing known and predicted information about genetic variants, otherwise known as variant annotations, is also described.
Description of the Related Art
Advances in high-throughput DNA sequencing technologies have enabled the identification of millions of genetic variants in an individual human genome. Reductions in sequencing costs and increases in sequencing efficiency have brought these capabilities within the grasp of individual laboratories looking to use DNA sequencing as a powerful tool in their research endeavors, yet very few laboratories have the computational expertise and infrastructure to make sense of the genetic variants identified through these studies. While increasingly sophisticated tools continue to be developed for sequence assembly and variant calling, interpretation of the massive number of genetic variants generated by any sequencing project remains a major challenge. This problem is especially pronounced in the interpretation of noncoding variants that likely explain a major proportion of heritability in common complex diseases. Because of the extreme difficulty and computational burden associated with interpreting regulatory variants and variations across collections of genes, genome sequencing studies have focused on the analysis of non-synonymous coding variants in single genes. This strategy has been effective in identifying mutations associated with rare and severe familial disorders; however, analysis of types of variants must be made accessible to the research community in order to address the locus and allelic heterogeneity that almost certainly underlies most common disease predisposition.
The availability of high-throughput DNA sequencing technologies has enabled nearly comprehensive investigations into the number and types of sequence variants possessed by individuals in different populations and with different diseases. For example, not only is it now possible to sequence a large number of genes in hundreds if not thousands of people, but it is also possible to sequence entire individual human genomes in the pursuit of inherited disease-causing variants or somatic cancer-causing variants. Whole genome sequencing as a relatively routine procedure may lie in the near future as high-throughput sequencing costs and efficiency continue to improve. In fact, as costs continue to decline, high-throughput sequencing is expected to become a commonly used tool, not only in human phenotype based sequencing projects, but also as an effective tool in forward genetics applications in model organisms, and for the diagnosis of diseases previously considered to be idiopathic, for which there are already some striking examples.
One particularly vexing problem that has accompanied the development and application of high-throughput sequencing is making sense of the millions of variants identified per genome. Recent successes at identifying variants associated with disease have generally been executed under clever yet restricted conditions. For example, a number of re-sequencing studies have focused on the identification of causal variants at significant genome-wide association study (GWAS) loci and have identified excesses of non-synonymous variants in nearby candidate genes. However, these potentially causal variants tend not to explain much more of the heritability than the GWAS tag SNP itself, a large proportion (˜80%) of GWAS hits are in intergenic regions with no protein-coding elements nearby, and, even with extremely large study populations, the GWAS strategy is not likely to individually identify tag-SNPs that explain even half the heritability of common diseases.
Nevertheless, GWAS has plenty left to offer in terms of identification of significant, or at least suspicious, candidate loci for re-sequencing studies. Other sequencing strategies have successfully identified non-synonymous variants associated with familial and/or severe disorders. However, if highly penetrant variants contribute to common disease predisposition, they should be detectable by linkage analysis. Linkage and straightforward association strategies have not identified the majority of variants predisposing to common diseases where variable penetrance, allelic and locus heterogeneity, epistasis, gene-gene interactions, and regulatory variation play a more important yet elusive role. If sequence-based association studies are to successfully identify variants associated with common diseases and expand our understanding of the heritable factors involved in disease predisposition, investigators must be armed with the tools necessary for identification of moderately penetrant disease causing variants, outside of GWAS hits, and beyond simple protein coding changes. In fact, the identification and interpretation of variants associated with inherited but not strongly familial disease, is a crucial step in translating sequencing efforts into a truly significant impact on public health.
If one accepts the rare variant hypothesis of disease predisposition, then one would expect rare variants predisposing to disease will be associated with high relative risk, but because of their low frequency, simple univariate analyses where each variant is tested for association with disease will require extremely large sample sizes to achieve sufficient power. This problem is compounded tremendously if disease predisposition results from the interaction and combination of extremely rare variants segregating and encountering one another throughout the population. Variant collapsing strategies have been shown to be a powerful approach to rare variant analysis; however, collapsing methods are extremely sensitive to the inclusion of noncausal variants within collapsed sets.
The key to unlocking the power of variant collapsing methods, and facilitating sequence based disease association studies in reasonable study sizes and at reasonable cost, is a logical approach to forming collapsed sets. In fact, regardless of the allelic frequency and penetrance landscape underlying common disease predisposing variants, set based analyses can expose what simple linkage or association studies have failed to reveal.
Recent successes in clinical genome sequencing, especially in family-based studies of individual with rare, severe and likely single-gene disorders, have highlighted the potential for genome sequencing to greatly improve molecular diagnosis and clinical decision making. However, these successes have relied on large bioinformatics teams and in-depth literatures surveys, an approach it is neither scalable nor rapid. The adoption of genome sequencing among the clinical community at large requires, among other things, the ability to rapidly identify a small set of candidate disease-causing (i.e., likely pathogenic) mutations from among the tens to hundreds of genes harboring variants consistent with plausible functional effects, inheritance patterns and population frequencies. Presented herein is a framework for the identification of rare disease-causing mutations, with a focus on phenotype-informed network raking (PIN Rank) algorithm for ordering candidate disease-causing mutations identified from genome sequencing. Our proposed algorithm's accuracy in prioritizing variations is demonstrated by applying it to a number of test cases in which the true causative variant is known.