There is a need for improved methods for determining the parental contribution to the genomes of higher organisms, i.e., haplotype phasing of genomes. Methods for haplotype phasing, including computational methods and experimental phasing, are reviewed in Browning and Browning, Nature Reviews Genetics 12:703-7014, 2011.
Most mammals, including humans, are diploid, with half of the homologous chromosomes being derived from each parent. Many plants have genomes that are polyploid. For example, wheat (Triticum spp.) have a ploidy ranging from diploid (Einkorn wheat) to quadriploid (emmer and durum wheat) to hexaploid (spelt wheat and common wheat [T. aestivum]).
The context in which variations occur on each individual chromosome can have profound effects on the expression and regulation of genes and other transcribed regions of the genome. Further, determining if two potentially detrimental mutations occur within one or both alleles of a gene is of paramount clinical importance. For plant species, knowledge of the parental genetic contribution is important for breeding progeny with desirable traits.
Current methods for whole-genome sequencing lack the ability to separately assemble parental chromosomes in a cost-effective way and describe the context (haplotypes) in which variations co-occur. Simulation experiments show that chromosome-level haplotyping requires allele linkage information across a range of at least 70-100 kb. This cannot be achieved with existing technologies that use amplified DNA, which are be limited to reads less than 1000 bases due to difficulties in uniform amplification of long DNA molecules and loss of linkage information in sequencing. Mate-pair technologies can provide an equivalent to the extended read length but are limited to less than 10 kb due to inefficiencies in making such DNA libraries (due to the difficulty of circularizing DNA longer than a few kb in length). This approach also needs extreme read coverage to link all heterozygotes.
Single molecule sequencing of greater than 100 kb DNA fragments would be useful for haplotyping if processing such long molecules were feasible, if the accuracy of single molecule sequencing were high, and detection/instrument costs were low. This is very difficult to achieve on short molecules with high yield, let alone on 100 kb fragments.
Most recent human genome sequencing has been performed on short read-length (<200 bp), highly parallelized systems starting with hundreds of nanograms of DNA. These technologies are excellent at generating large volumes of data quickly and economically. Unfortunately, short reads, often paired with small mate-gap sizes (500 bp-10 kb), eliminate most SNP phase information beyond a few kilobases (McKernan et al., Genome Res. 19:1527, 2009). Furthermore, it is very difficult to maintain long DNA fragments in multiple processing steps without fragmenting as a result of shearing.
At the present time three personal genomes, those of J. Craig Venter (Levy et al., PLoS Biol. 5:e254, 2007), a Gujarati Indian (HapMap sample NA20847; Kitzman et al., Nat. Biotechnol. 29:59, 2011), and two Europeans (Max Planck One [MP1]; Suk et al., Genome Res., 2011; http://genome.cshlp.org/content/early/2011/09/02/gr.125047.111.full.pdf; and HapMap Sample NA 12878; Duitama et al., Nucl. Acids Res. 40:2041-2053, 2012) have been sequenced and assembled as diploid. All have involved cloning long DNA fragments into constructs in a process similar to the bacterial artificial chromosome (BAC) sequencing used during construction of the human reference genome (Venter et al., Science 291:1304, 2001; Lander et al., Nature 409:860, 2001). While these processes generate long phased contigs (N50s of 350 kb [Levy et al., PLoS Biol. 5:e254, 2007], 386 kb [Kitzman et al., Nat. Biotechnol. 29:59-63, 2011] and 1 Mb [Suk et al., Genome Res. 21:1672-1685, 2011]) they require a large amount of initial DNA, extensive library processing, and are too expensive to use in a routine clinical environment.
Additionally, whole chromosome haplotyping has been demonstrated through direct isolation of metaphase chromosomes (Zhang et al., Nat. Genet. 38:382-387, 2006; Ma et al., Nat. Methods 7:299-301, 2010; Fan et al., Nat. Biotechnol. 29:51-57, 2011; Yang et al., Proc. Natl. Acad. Sci. USA 108:12-17, 2011). These methods are useful for long-range haplotyping but have yet to be used for whole-genome sequencing; they require preparation and isolation of whole metaphase chromosomes, which can be challenging for some clinical samples.
There is also a need for improved methods for obtaining sequence information from mixtures of organisms such as in metagenomics (e.g., gut bacteria or other microbiomes). There is also a need for improved methods for genome sequencing and assembly, including de novo assembly with no or minimal use of a reference sequence), or assembly of genomes that include various types of repeat sequences, including resolution of pseudogenes, copy number variations and structural variations, especially in cancer geneomes.
We have described long fragment read (LFR) methods that provide enable an accurate assembly of separate sequences of parental chromosomes (i.e., complete haplotyping) in diploid genomes at significantly reduced experimental and computational costs and without cloning into vectors and cell-based replication. LFR is based on the physical separation of long fragments of genomic DNA (or other nucleic acids) across many different aliquots such that there is a low probability of any given region of the genome of both the maternal and paternal component being represented in the same aliquot. By placing a unique identifier in each aliquot and analyzing many aliquots in the aggregate, DNA sequence data can be assembled into a diploid genome, e.g., the sequence of each parental chromosome can be determined. LFR does not require cloning fragments of a complex nucleic acid into a vector, as in haplotyping approaches using large-fragment (e.g., BAC) libraries. Nor does LFR require direct isolation of individual chromosomes of an organism. In addition, LFR can be performed on an individual organism and does not require a population of the organism in order to accomplish haplotype phasing. LFR methods have been described in U.S. patent application Ser. Nos. 12/329,365 and 13/447,087, U.S. Pat. Publications US 2011-0033854 and 2009-0176234, and U.S. Pat. Nos. 7,901,890, 7,897,344, 7,906,285, 7,901,891, and 7,709,197, all of which are hereby incorporated by reference in their entirety.