The present disclosure is in the field of DNA amplification and sequencing, and microfluidic processing devices. Ideally, genome sequencing can deliver the genome and the epigenome sequence of a single cell with 100% accuracy and end-to-end contiguity at low cost.
It is not clear if current nanopore sequencing technologies can deliver the read lengths, accuracy and throughput for rapid de novo genome sequencing at low cost (Branton et al. 2008; Cherf et al. 2012; Clarke et al. 2009; Kumar et al. 2012; Manrao et al. 2012; McNally et al. 2010; Wallace et al. 2010; Wanunu 2012). Due to the inefficiency in capturing DNA into nanopores (Branton et al. 2008; Wanunu et al. 2010), it would not be feasible to sequence the genome of a single cell without some sample preparation, including fragmentation and amplification. The current generation of sequencers, which are mostly based on sequencing by synthesis using DNA polymerases, are remarkable in terms of sequencing throughput and accuracy (e.g. close to 1 trillion bases per run with 99.9% raw accuracy for most reads for the Illumina HiSeq 2500) despite the relative short reads (a few hundred bases or much shorter). The per-base sequencing cost has also been brought down drastically at rapid pace. However, many technical challenges remain to be overcome to achieve the quality of the genome sequence in terms of per-base accuracy, the contiguity of the assembly and complete phasing of haplotype for personalized medicine (Baker et al. 2012; Marx 2013).
First, the assembly of genomes with highly repetitive sequences using short reads (a few hundred bases or shorter) produced by these high-throughput sequencers is extremely challenging (Baker et al. 2012; Bradnam et al. 2013; Li et al. 2010; Marx 2013; Salzberg et al. 2012; Treangen et al. 2012). De novo sequencing and assembly of diploid genomes with full haplotype resolution is even more difficult. Second, the accuracy that can be achieved with current sequencing technologies is still relatively low (consensus error rate of 1 error in 10 million is the best reported (Peters et al. 2012)). Sequencing errors are primarily due to limitations of the sequencing chemistry, which at best has a raw read accuracy of 99.9% (i.e. an error rate of 10−3), and errors introduced by the sample preparation process, in particular DNA amplification by DNA polymerases which usually have error rate not better than 10−6.
Single-cell de novo genome sequencing is even more challenging because the current technologies require DNA input from the equivalent of many cells (20-10,000 depending on the platform) (Kalisky et al. 2011). Yet the ability to sequence the genome of single cells has very important applications in basic biomedical research and even greater impacts on the application of genome sequencing in clinical practices (Kalisky et al. 2011). For example, this allows for the comprehensive characterization of the cellular heterogeneity that underlies normal cellular differentiation and diseases such as cancer (Ma et al. 2012; Navin et al. 2011; Navin et al. 2011; Potter et al. 2013; Powell et al. 2012), the very early detection of cancer using circulating tumor cells or fine needle biopsies, mutation detection (Lu et al. 2012; Wang et al. 2012), for the genetic screening by whole genome sequencing of single cell extracted from early stage human embryos prior to implantation in IVF clinics (Lorthongpanich et al. 2013; Martin et al. 2013; Zhang et al. 2013). In the latter case, only one or very few cells are available, and sequencing and haplotype accuracy is paramount as the results will directly impact the life of a newborn. Genetic defects in both alleles of the maternal and paternal chromosomes need to be identified with the utmost accuracy.
Before de novo single-cell genome sequencing, the genomic DNA can be amplified. Ideally, the method used amplifies the entire genome from a cell with complete coverage and very little bias. Few technologies are available for this purpose. The commonly used MDA (Multiple Displacement Amplification) method (Dean et al. 2002; Lage et al. 2003) usually results in very large bias in coverage, with up to four orders of magnitude of variation, and frequent dropout of certain sequences. MALBAC (Multiple Annealing and Looping Based Amplification Cycles) (Lu et al. 2012; Zong et al. 2012) and MIDAS (MIcrowell Displacement Amplification System) (Gole et al. Nature Biotech. In press) for whole-genome amplification of single cells are better (Fan et al. 2011; Gole et al. Nature Biotech. In press; Zong et al. 2012), but they still have limitations in terms of sequence coverage and bias, and amplification errors (mutations and creation of chimeras), which are problematic. These result in incomplete assembly, waste, and greater sequencing cost (by one or more orders of magnitude) since many fold coverage is required to acquire the low abundant sequences. Numerous mutations and chimeras also lead to assembly and sequencing errors (Lasken et al. 2007; Voet et al. 2013). In addition, none of these technologies offers mechanisms for resolving haplotypes.
These technologies were derived, at least conceptually, from the seminal rolling circle amplification (RCA) technology (Lizardi et al. 1998). Amplification by RCA is essentially error-free because the same original circular DNA template is repeatedly copied through a rolling circle strand-displacement mechanism from a single primer using a high-fidelity DNA polymerase. We have developed a method for sequence- and length-independent linear DNA amplification using nicking endonuclease-mediated strand displacement amplification (Joneja et al. 2011). The use of nicking endonucleases is not ideal since there are many recognition sequences in the genome. Long Range Strand Displacement Amplification (LR-SDA) technology, described herein, is designed to overcome the limitations described above by using a unique mechanism. LR-SDA is radically different from other methods in that free primers are removed from the reaction solution and no free 3′ ends are produced in the process, preventing chimera formation. LR-SDA enables essentially error-free amplification of DNA in very long overlapping fragments, which facilitates the accurate sequencing and haplotyping of genome sequences.
A new generation of sequencing technologies has enabled DNA sequencing at unprecedented high throughput and accuracy, and has also drastically brought down the per-base sequencing cost. What is needed is the ability to acquire contiguity information to phase haplotypes and assemble genomes de novo, and to improve the consensus read accuracy to the point that a genome can be sequenced with complete end to end assembly error-free.