Advances in DNA sequencing hold the promise to standardize and develop non-invasive molecular diagnosis to improve prenatal care, transplantation efficacy, cancer and other disease detection and individualized treatment. Currently, patients with predisposing or early disease are not identified, and those with disease are not given the best treatment—all because of failures at the diagnostic level.
In the cancer field, there is a need to develop such technology for early detection, guiding therapy, and monitoring for recurrence—all from a blood sample. This includes the need to develop: (i) high sensitivity detection of single base mutation, small insertion, and small deletion mutations in known genes (when present at 1% to 0.01% of cell-free DNA); (ii) high sensitivity detection of promoter hypermethylation and hypomethylation (when present at 1% to 0.01% of cell-free DNA); (iii) accurate quantification of tumor-specific mRNA, lncRNA, and miRNA isolated from tumor-derived exosomes or RISC complex, or circulating tumor cells in blood; (iv) accurate quantification of tumor-specific copy changes in DNA isolated from circulating tumor cells; (v) accurate quantification of mutations, promoter hypermethylation and hypomethylation in DNA isolated from circulating tumor cells. All these (except quantification of tumor-specific copy changes in DNA isolated from circulating tumor cells) require focusing the sequencing on targeted genes or regions of the genome. Further, determination of the sequence information or methylation status from both strands of the original fragment provides critically needed confirmation of rare events.
Normal plasma contains nucleic acids released from normal cells undergoing normal physiological processes (i.e. exosomes, apoptosis). There may be additional release of nucleic acids under conditions of stress, inflammation, infection, or injury. In general, DNA released from apoptotic cells in an average of 160 bp in length, while DNA from fetal cells is an average of about 140 bp. Plasma from a cancer patient contains nucleic acids released from cancer cells undergoing abnormal physiological processes, as well as within circulating tumor cells (CTCs). Likewise, plasma from a pregnant woman contains nucleic acids released from fetal cells.
There are a number of challenges for developing reliable diagnostic and screening tests. The first challenge is to distinguish those markers emanating from the tumor or fetus that are indicative of disease (i.e. early cancer) vs. presence of the same markers emanating from normal tissue. There is also a need to balance the number of markers examined and the cost of the test, with the specificity and sensitivity of the assay. This is a challenge that needs to address the biological variation in diseases such as cancer. In many cases the assay should serve as a screening tool, requiring the availability of secondary diagnostic follow-up (i.e. colonoscopy, amniocentesis). Compounding the biological problem is the need to reliably detect nucleic acid sequence mutation or promoter methylation differences, or reliably quantify DNA or RNA copy number from either a very small number of initial cells (i.e. from CTCs), or when the cancer or fetus-specific signal is in the presence of a majority of nucleic acid emanating from normal cells. Finally, there is the technical challenge to distinguish true signal resulting from detecting the desired disease-specific nucleic acid differences vs. false signal generated from normal nucleic acids present in the sample vs. false signal generated in the absence of the disease-specific nucleic acid differences.
By way of an example, consider the challenge of detecting, in plasma, the presence of circulating tumor DNA harboring a mutation in the p53 gene or a methylated promoter region. Such a sample will contain a majority of cell-free DNA arising from normal cells, where the tumor DNA may only comprise 0.01% of the total cell-free DNA. Thus, if one were to attempt to find the presence of such mutant DNA by total sequencing, one would need to sequence 100,000 genomes to identify 10 genomes harboring the mutations. This would require sequencing 300,000 GB of DNA, a task beyond the reach of current sequencing technology, not to mention the enormous data-management issues. To circumvent this problem, many groups have attempted to capture specific target regions or to PCR amplify the regions in question. Sequence capture has suffered from dropout, such that maybe 90-95% of the desired sequences are captured, but desired fragments are missing. Alternatively, PCR amplification provides the risk of introducing a rare error that is indistinguishable from a true mutation. Further, PCR loses methylation information. While bisulfite treatment has been traditionally used to determine the presence of promoter methylation, it is also destructive of the DNA sample and lacks the ability to identify multiple methylation changes in cell-free DNA.
There are a number of different approaches for reducing error rate and improving the accuracy of sequencing runs. A consensus accuracy may be achieved in the presence of high error rates by sequencing the same region of DNA over and over again. However, a high error rate makes it extremely difficult to identify a sequence variant in low abundance, for example when trying to identify a cancer mutation in the presence of normal DNA. Therefore, a low error rate is required to detect a mutation in relatively low abundance.
The first approach termed tagged-amplicon deep sequencing (TAm-Seq) method (Forshew et al., “Noninvasive Identification and Monitoring of Cancer Mutations by Targeted Deep Sequencing of Plasma DNA,” Sci Transl Med. 4(136):136 (2012)) is based on designing primers to amplify 5995 bases that covered select regions of cancer-related genes, including TP53, EGFR, BRAF, and KRAS. This approach is able identify mutations in the p53 gene at frequencies of 2% to 65%. In this approach, primers are designed to pre-amplify the DNA (for 15 cycles) in a multiplexed reaction with many PCR primers. This creates both desired and undesired products, so it is followed with single-plex PCR to further amplify each of the desired products. The fragments subject to a final barcoding PCR prior to standard next-generation sequencing. The advantage of this approach is it uses the time tested multiplexed PCR-PCR, which is unparalleled for amplification of low numbers of starting nucleic acids. The disadvantage is that this approach is unable to distinguish a true mutation from a PCR error in the early rounds of amplification. Thus while the sensitivity of 2% (i.e. detecting one mutant allele in 50 wt alleles) is sufficient for evaluating late-stage cancers prior to making a treatment decision, it is not sensitive enough for early detection.
A variation of the first approach is termed Safe-Sequencing System “Safe-SeqS” (Kinde et al., “Detection and Quantification of Rare Mutations with Massively Parallel Sequencing,” Proc Natl Acad Sci USA 108(23):9530-5 (2011)), where randomly sheared genomic DNA is appended onto the ends of linkers ligated to genomic DNA. The approach demonstrated that the vast majority of mutations described from genomic sequencing are actually errors, and reduced presumptive sequencing errors by at least 70-fold. Likewise, an approach called ultrasensitive deep sequencing (Narayan et al., “Ultrasensitive Measurement of Hotspot Mutations in Tumor DNA in Blood Using Error-suppressed Multiplexed Deep Sequencing,” Cancer Res. 72(14):3492-8 (2012)) appends bar codes onto primers for a nested PCR amplification. Presumably, a similar system of appending barcodes was developed to detect rare mutations and copy number variations that depends on bioinformatics tools (Talasaz, A.; Systems and Methods to Detect Rare Mutations and Copy Number Variation, US Patent Application US 2014/0066317 A1, Mar. 6, 2014). Paired-end reads are used to cover the region containing the presumptive mutation. This method was used to track known mutations in plasma of patients with late stage cancer. These approaches require many reads to establish consensus sequences. Both of these methods requires extending across the target DNA, and thus it would be impossible to distinguish true mutation, from polymerase generated error, especially when copying across a damaged base, such as deaminated cytosine. Finally, these methods do not provide information on methylation status of CpG sites within the fragment.
The second approach termed Duplex sequencing (Schmitt et al., “Detection of Ultra-Rare Mutations by Next-Generation Sequencing,” Proc Natl Acad Sci USA 109(36):14508-13 (2012)) is based on using duplex linkers containing 12 base randomized tags. By amplifying both top and bottom strands of input target DNA, a given fragment obtains a unique identifier (comprised of 12 bases on each end) such that it may be tracked via sequencing. Sequence reads sharing a unique set of tags are grouped into paired families with members having strand identifiers in either the top-strand or bottom-strand orientation. Each family pair reflects the amplification of one double-stranded DNA fragment. Mutations present in only one or a few family members represent sequencing mistakes or PCR-introduced errors occurring late in amplification. Mutations occurring in many or all members of one family in a pair arise from PCR errors during the first round of amplification such as might occur when copying across sites of mutagenic DNA damage. On the other hand, true mutations present on both strands of a DNA fragment appear in all members of a family pair. Whereas artifactual mutations may co-occur in a family pair with a true mutation, all except those arising during the first round of PCR amplification can be independently identified and discounted when producing an error-corrected single-strand consensus sequence. The sequences obtained from each of the two strands of an individual DNA duplex can then be compared to obtain the duplex consensus sequence, which eliminates remaining errors that occurred during the first round of PCR. The advantage of this approach is that it unambiguously distinguishes true mutations from PCR errors or from mutagenic DNA damage, and achieves an extraordinarily low error rate of 3.8×10−10. The disadvantage of this approach is that many fragments need to be sequenced in order to get at least five members of each strand in a family pair (i.e. minimum of 10 sequence reads per original fragment, but often requiring far more due to fluctuations). Further, the method has not been tested on cfDNA, which tend to be smaller then fragments generated from intact genomic DNA, and thus would require sequencing more fragments to cover all potential mutations. Finally, the method does not provide information on methylation status of CpG sites within the fragment.
The third approach, termed smMIP for Single molecule molecular inversion probes (Hiatt et al., “Single Molecule Molecular Inversion Probes for Targeted, High-Accuracy Detection of Low-Frequency Variation,” Genome Res. 23(5):843-54 (2013) combines single molecule tagging with multiplex capture to enable highly sensitive detection of low-frequency subclonal variation. The method claims an error rate of 2.6×10−5 in clinical specimens. The disadvantage of this approach is that many fragments need to be sequenced in order to get at least five members of each strand in a family pair (i.e. minimum of 10 sequence reads per original fragment, but often requiring far more due to fluctuations). Also, the method requires extending across the target DNA, and thus it would be impossible to distinguish true mutation, from polymerase-generated error, especially when copying across a damaged base, such as deaminated cytosine. Further, the method has not been tested on cfDNA, which tend to be smaller then fragments generated from intact genomic DNA, and thus would require sequencing more fragments to cover all potential mutations. Finally, the method does not provide information on methylation status of CpG sites within the fragment.
The fourth approach, termed circle sequencing (Lou et al., “High-throughput DNA Sequencing Errors are Reduced by Orders of Magnitude Using Circle Sequencing,” Proc Natl Acad Sci USA 110(49):19872-7 (2013), see also Mutational and fitness landscapes of an RNA virus revealed through population sequencing. Acevedo A, Brodsky L, Andino R., Nature. 2014 Jan. 30; 505(7485):686-90; and Library preparation for highly accurate population sequencing of RNA viruses. Acevedo A, Andino R. Nat Protoc. 2014 July; 9(7):1760-9.) is based on shearing DNA or RNA to about 150 bases, denaturing to form single strands, circularizing those single strands, using random hexamer primers and phi29 DNA polymerase for rolling circle amplification (in the presence of Uracil-DNA glycosylase and Formamidopyrimidine-DNA glycosylase), re-shearing the products to about 500 bases, and then proceeding with standard next generation sequencing. The advantage of this approach is that the rolling circle amplification makes multiple tandem copies off the original target DNA, such that a polymerase error may appear in only one copy, but a true mutation appears in all copies. The read families average 3 copies in size because the copies are physically linked to each other. The method also uses Uracil-DNA glycosylase and Formamidopyrimidine-DNA glycosylase to remove targets containing damaged bases, to eliminate such errors. The advantage of this technology is that it takes the sequencing error rate from a current level of about 0.1 to 1×10−2, to a rate as low as 7.6×10−6. The latter error rate is now sufficient to distinguish cancer mutations in plasma in the presence of 100 to 10,000-fold excess of wild-type DNA. A further advantage is that 2-3 copies of the same sequence are physically linked, allowing for verification of a true mutation from sequence data generated from a single fragment, as opposed to at least 10 fragments using the Duplex sequencing approach. However, the method does not provide the ability to determine copy number changes, nor provide information on methylation status of CpG sites within the fragment.
The fifth approach, developed by Complete Genomics (Drmanac et al., “Human Genome Sequencing Using Unchained Base Reads on Self-Assembling DNA Nanoarrays,” Science 327(5961):78-81 (2010)) is based on using ligation reads on nanoball arrays. About 400 nucleotides of genomic DNA are circularized with linkers, cleaved, recircularized with additional linkers, and ultimately recircularized to contain about four linkers. The DNA undergoes rolling circle amplification using phi 29 DNA polymerase to generate nanoballs. These are then placed onto an array, and sequenced using a ligation-based approach. The salient point of this approach, of relevance herein, is that multiple tandem copies of the same sequence may be generated and subsequently sequenced off a single rolling circle amplification product. Since the same sequence is interrogated multiple times by either ligase or polymerase (by combining rolling circle with other sequencing by synthesis approaches), the error rate per base may be significantly reduced. As such, sequencing directly off a rolling circle product provides many of the same advantages of the circle sequencing approach described above.
The sixth approach, termed SMRT—single molecule real time-sequencing (Flusberg et al., “Direct Detection of DNA Methylation During Single-Molecule, Real-Time Sequencing,” Nat Methods 7(6):461-5 (2010)) is based on adding hairpin loops onto the ends of a DNA fragment, and allowing a DNA polymerase with strand-displacement activity to extend around the covalently closed loop, providing sequence information on the two complementary strands. Specifically, single molecules of polymerase catalyze the incorporation of fluorescently labeled nucleotides into complementary nucleic acid strands. The polymerase slows down or “stutters” when incorporating a nucleotide opposite a methylated base, and the resulting fluorescence pulses allow direct detection of modified nucleotides in the DNA template, including N6-methyladenine, 5-methylcytosine and 5-hydroxymethylcytosine. The accuracy of the approach has improved, especially as the polymerase may traverse around the closed loop several times, allowing for determination of a consensus sequence. Although the technique is designed to provide sequence information on “dumbbell” shaped substrates (containing mostly the two complementary sequences of a linear fragment of DNA), it may also be applied to single-stranded circular substrates.
The present invention is directed at overcoming these and other deficiencies in the art.