Whole genome sequencing of related individuals provides opportunities for investigation of human recombination and compound heterozygous loci contributing to Mendelian disease traits as well as error control. The recent publication of low-coverage sequencing data from large numbers of unrelated individuals offers a broad catalog of genetic variation in three major population groups that is complementary to deep sequencing of related individuals. (Durbin, R. M., et al. A map of human genome variation from population-scale sequencing. Nature 467, 1061-1073 (2010). This and all other cited references are incorporated herein by reference in their entirety.) Recently, investigators used a family-sequencing approach to fine map recombination sites, and combined broad population genetic variation data with phased family variant data to identify putative compound heterozygous loci associated with the autosomal recessive Miller syndrome. (Roach, J. C., et al. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science 328, 636-639 (2010).)
One of the challenges to the interpretation of massively parallel whole genome sequence data is the assembly and variant calling of sequence reads against the human reference genome. Although de novo assembly of genome sequences from raw sequence reads represents an alternative approach, computational limitations and the large amount of mapping information encoded in relatively invariant genomic regions make this an unattractive option presently. The NCBI human reference genome in current use (Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 35, D61-65 (2007).) is derived from DNA samples from a small number of anonymous Caucasian donors and, therefore, represents a small sampling of the broad array of human genetic variation. Additionally, this reference sequence contains both common and rare disease risk variants (Chen, R. & Butte, A. J. The reference human genome demonstrates high risk of type 1 diabetes and other disorders. Pac Symp Biocomput, 231-242 (2011).) and may bias interpretation of genetic variants aligned and called against the sequence.
With the advent of the human genome project and the draft sequences of human genomic DNA came promises to revolutionize personalized health care by tailoring risk modification, medications, and health surveillance to patients' individual genetic backgrounds. High-throughput whole genome sequencing (“Next generation sequencing”) has revolutionized the study of genetic variation and facilitated a precipitous drop in the cost per quantum of genetic variation data generated. This trend has thus far outstripped Moore's law by a factor of two, opening the door for population-wide genome sequencing. As such, technologies for interpretation of the massive amounts of genetic data produced with each genome sequence must advance in step.
One of the upstream challenges to interpretation of this genetic data is the reliance on a human reference genome sequence to 1) identify the genomic location of the billions of short (30 to 500 base pair) sequence reads produced in a massively parallel fashion by high-throughput sequencing, and 2) identify variation in other individuals from this “normal” sequence.
Another barrier to the realization of the goal of genome interpretation pipelines is the difficulty in assigning “phase,” or parental origin of genetic variants, in sequencing studies, because genotypes, not haplotypes, are given at each position. The assessment of compound heterozygous and multigenic disease risk, integration of sex-specific risk inheritance, and integration of genetic background in areas of high recombination, such as the Human Leukocyte Antigen Loci, will be crucial to understanding genome wide risk in related individuals, because these analyses are wholly reliant on phased haplotype data.
Another barrier to the realization of the goal of genome interpretation pipelines is the difficulty in assigning risk estimates to the millions of genetic variants present in any individual. These variants harbor between 50 and 100 potentially damaging mutations in genes associated with Mendelian (“single gene”) disorders, the majority of which are of uncertain significance.
Therefore, there is a need for improved methods for improved methods for analyzing genome sequence data.