Early DNA sequencing techniques, such as chain-termination methods, provided reliable solutions for reading individual DNA fragments See Sanger, F. et. al. (1977) DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. U.S.A. 74, 5463-5467. While these first-generation technologies are effective for sequencing target genes, applying them to sequencing entire chromosomes or genomes is costly and expensive. For example, the first sequencing of a human genome—which was accomplished using the Sanger method—cost hundreds of millions of dollars and took over a decade to complete. This high cost was largely due to the sequential nature of first-generation sequencing methods; each fragment had to be individually read and manually assembled to construct a full genome.
Next generation sequencing (NGS) technologies have significantly reduced the cost of DNA sequencing by parallelizing DNA fragment reading. Some NGS methods are capable of performing millions of sequence reads concurrently, generating data for millions of base pairs in a matter of hours. See Hall, N. (2007) Advanced sequencing technologies and their wider impact in microbiology. The Journal of Experimental Biology, 209, 1518-1525. Many NGS technologies have been proposed, and employ various chemical processes, use varying read lengths, and have demonstrated various ranges of accuracy. See Metzker, M. (2010) Sequencing technologies—the next generation. Nature Reviews, Genetics, Volume 11, 31-46; see also Shendure, J. et. al. (2008) Next-generation DNA sequencing. Nature Reviews, Biotechnology, Volume 26, Number 10, 1135-1145.
NGS methods generally involve separating a DNA sample into fragments and reading the nucleotide sequence of those fragments in parallel. The resulting data generated from this process includes read data for each of those fragments, which contains a continuous sequence of nucleotide base pairs (G, A, T, C). However, while the arrangement of base pairs within a given fragment read is known, the arrangement of the fragment reads with respect to each other is not. Thus, to determine the sequence of a larger DNA strand (such as a gene or chromosome), read data from multiple fragments must be aligned. This alignment is relative to other read fragments, and may include overlapping fragments, depending upon the particular NGS method used. Some NGS methods use computational techniques and software tools to carry out read data alignment.
Accurate sequence read alignment is the first step in identifying genetic variations in a sample genome. The diverse nature of genetic variation can cause alignment algorithms and techniques to align sequence reads to incorrect locations within the genome. Furthermore, the read process used to generate sequence reads may be complex and susceptible to errors. Thus, many sequence read alignment techniques can misalign a sequence read within a genome, which can lead to incorrect detection of variants in subsequent analyses.
Once the read data has been aligned, that aligned data may be analyzed to determine the nucleotide sequence for a gene locus, gene, or an entire chromosome. However, differences in nucleotide values among overlapping read fragments may be indicative of a variant, such as a single-nucleotide polymorphism (SNP) or an insertion or deletion (INDELs), among other possible variants. For example, if read fragments that overlap at a particular locus differ, those differences might be indicative of a heterozygous SNP. As another example, if overlapping read fragments are the same at a single nucleotide, but differ from a reference genome, that gene locus or gene may be a homozygous SNP with respect to that reference genome. Accurate determination of such variants is an important aspect of genome sequencing, since those variants could represent mutations, genes that cause particular diseases, and/or otherwise serve to genotype a particular DNA sample.
The demand for high efficiency and low-cost DNA sequencing has increased in recent years. Although NGS technologies have dramatically improved upon first-generation technologies, the highly-parallelized nature of NGS techniques has presented challenges not encountered in earlier sequencing technologies. Errors in the read process can adversely impact the alignment of the resulting read data, and can subsequently lead to inaccurate sequence determinations. Furthermore, read errors can lead to erroneous detection of variants.
Currently, there are different approaches to discovering genetic variation from next-generation sequencing data. They fall, broadly, in two categories: (1) mapping-based approaches which rely on a sophisticated aligner to place the reads properly on the genome (e.g., Samtools and FreeBayes) and, (2) assembly-based approaches which attempt to discover new haplotypes in the reads by assembling them into various types of graphs (e.g., HaplotypeCaller and Platypus). Because none of the current approaches are capable of a detail given such unique data, there exists a need for a method of identifying and quantifying genomic data.
A more comprehensive and accurate understanding of both the human genome as a whole and the genomes of individuals will improve medical diagnoses and treatment. NGS technologies have reduced the time and cost of sequencing an individual's genome, which provides the potential for significant improvements to medicine and genetics in ways that were previously not feasible. Understanding genetic variation among humans provides a framework for understanding genetic disorders and Mendelian diseases. However, discovering these genetic variations depends upon reliable read data and accurate read sequence alignment.