Amplicon-based targeted sequencing
Next Generation Sequencing (NGS) has been an active area of focus for a large number of organizations. Commercial corporations and Research and Development (R&D) outfits perform NGS of tumor samples in order to determine the presence of genetic/genomic alterations in the DNA or RNA of patient samples. A key application of interest is the determination of somatic alterations in tumor biopsy samples from cancer patients.
Such alterations can be used to determine the tumor type and disease aggressiveness, and have been shown to be correlated to the patient's clinical response to different therapies. In some cases, the efficacy of existing therapies is directly linked to the presence of specific alterations such as Kirsten Rat Sarcoma (KRAS) and Epidermal Growth Factor Receptor (EGFR) mutations. In general, somatic mutation detection is effectively used by physicians for therapy selection, prognosis and diagnosis.
Targeted sequencing for somatic mutation detection refers to the selection of only certain portions of the genome that are to be sequenced. This is often achieved by over-amplifying certain portions of the genome, typically consisting of a finite number of contiguous sequences from 70 to 200 bases in length. These bases are termed amplicons. There may be hundreds to thousands of amplicons assembled as part of an amplicon panel that covers the genes important to a certain type of cancer.
The advantage of amplicon sequencing is the ability to sequence at a higher depth, for a lower price, by concentrating on regions of the genome where alterations are likely to occur. Organizations offering targeted sequencing based somatic mutation detection on a commercial scale include Foundation Medicine, and cancer center sequencing labs at outfits such as MD Anderson, Cleveland Clinic, and Stanford Cancer Center.
There are two important limitations to both targeted sequencing and other sequencing for the determination of somatic mutations/alterations:
(1) Insufficient Availability of Tissue                That is because this type of sequencing requires a tumor biopsy. Traditional biopsy procedures often have significant associated risks and loss of quality of life for the patient, and can only be performed a few times during the disease progression cycle. If a sample is compromised for any reason, it is often impossible to obtain a second tissue sample from the same patient. Furthermore, in some cases, due to the tumor's location in an inaccessible region, a traditional biopsy is not feasible.        
(2) Low Tumor Content                While the introduction of Fine Needle Aspirate (FNA) procedures have reduced the risks and discomfort associated with biopsies, the resulting samples are much less abundant and contain a variable ratio of tumor-derived to normal tissue DNA. Most commercially available tests require at least 20% tumor to normal tissue content, as reported in the Non-Patent Literature (NPL) reference “Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing”, dated November, 2013 by Frampton et al., and appearing in Nature Biotechnology, Volume 31, Number 11.        The tumor purity requirement is dictated by the limits of standard sequencing and variant calling, which does not function well below 5% Allele/Allelic Frequency (AF). Making calls below 5% AF leads to a high number of false positive (FP) calls. Therefore, typical diagnostic pipelines only call mutants if they are above AF 5%. Some tests go down to 3% AF, but do not call below that level since significant numbers of FP calls would be made. These errors preclude the use of samples where tumor material is not sufficient, i.e. below the 20% tumor to normal tissue ratio.        
Because of the above limitations, it is apparent that higher sensitivity and specificity sequencing will be beneficial to tumor biopsy profiling where the biopsies have low tumor content. That is one shortcoming of the prior art that the instant invention addresses. The instant approach leads to a higher percentage of measurable samples.
Liquid Biopsies and NGS
The limitations of solid tumor biopsies include its high cost, associated complications and inability to track tumor progression over time. To address these limitations, several non-invasive avenues of obtaining tumor-derived nucleic acids (RNA, DNA) have been proposed. Starting samples obtained from the patient include but are not limited to, blood or blood components, urine, stool samples, pleural fluid, ascites, or sputum. The chief advantage of a minimally invasive biopsy (or a liquid biopsy) is that samples are easily obtained at minimal risk to the patient.
The samples can also be obtained at many time points during diagnosis and treatment. If somatic variants can be accurately detected in such samples, it is possible to track the changes in tumor mutation burden over time, because the variants demonstrate correlation to mutations present in the primary tumor. Furthermore, such minimally invasive or non-invasive testing can even be used pre-diagnosis, as a screening tool for the general population.
A key challenge for liquid biopsies is the very low tumor content as compared to a tumor biopsy, ranging from <0.1% AF to about 10% AF in advanced patients. Liquid biopsy should be taken to include all liquid sample types, including cell free DNA (cfDNA) and circulating tumor cells (CTCs) that have a background of wild type DNA from either white blood cells or the rest of the plasma. In earlier stage patients or patients with certain cancer types, these fractions are even lower, from <0.01% AF to 0.5% AF. To address this challenge, a number of approaches have been put forward:
(A) Deep Sequencing: Increased Read-Depth                Increasing the depth of sequencing (or the number of sequence reads at a certain locus/location) provides the advantage of more accurately determining the percentage of mutant molecules present in the sample. This gives the ability to detect a greater number of reads derived from the mutated DNA. This in turn offers the possibility of detecting low AF variants. For example, a 0.1% AF variant requires 10,000 overall reads in order to have at least 10 reads of the mutant molecule compared to 9,990 of the wild type molecule reads. Similarly, increasing depth even further may provide a more accurate representation of the true mutant percentage.        Despite these advantages, replication errors such as those produced in Polymerase Chain Reaction (PCR) and other errors that recur in the replication and/or sequencing processes, persist and cannot be eliminated by simply increasing the read-depth. The propensity of such errors may be reduced by using higher fidelity enzymes during the replication process. However, these errors can never be eliminated altogether and constitute a large background of falsely detected mutations at AFs below 0.5%.        
(B) Reference Sample and Background Error Rate                One approach for reducing the false positive rate is the use of a reference sample. This is typically a sample extracted from the same patient, but one that does not include tumor material. This is helpful in that any alterations present in the reference sample can be assumed to be due to inherited difference of the patient genome from the reference genome i.e. germline mutations. If called during the somatic mutation testing, these can be eliminated as false positives (FP).        In cases where the matched normal is not available from the same clinical patient, a “normal” DNA control sample may be drawn from another matching donor with a healthy tissue or bodily fluid known to be devoid of somatic mutations. In this case, while the alterations in the reference sample are not due to inherited differences, they can still be eliminated as FPs because they belong to the donor who is known to be healthy. This can eliminate FPs in places where there are systematic errors. However, this process cannot eliminate errors due to the misdetection of germline mutations as somatic.        A benefit of this approach is that it can also detect abnormalities that are due to contamination if the contamination source happens to be the same for the reference and tumor samples. Similarly, this approach can also detect very high likelihood alterations due to replication and sequencing errors. The reader is again referred to NPL reference “Development and validation of a clinical cancer genomic profiling test based on massively parallel DNA sequencing”, dated November, 2013 by Frampton et al., and appearing in Nature Biotechnology, Volume 31, Number 11,        For a related approach based on estimating background error rate, the reader is referred to NPL references “Ultrasensitive detection of rare mutations using next-generation targeted resequencing”, dated October 2011 by Flaherty et al., appearing in Nucleic Acids Research, Volume 40, and “Analytical and Clinical Validation of a Digital Sequencing Panel for Quantitative, Highly Accurate Evaluation of Cell-Free Circulating Tumor DNA” dated October 2015 by Lanman et al., appearing in Public Library of Science's PLOS ONE publication, Digital Object Identifier (DOI): 10.1371/journal.pone.0140712.        Specifically, in Flaherty et al., a single target sample sequence measurement is compared to the background distribution to generate a p-value using the beta-binomial distribution. The shortcoming of this approach, however, is that it requires very high read-depth (approx. 10E6 read-depth) and was only tested with 300 base-pairs (bp). Even at this very high read-depth, there is still generally a high FP rate, i.e. the described specificity of 0.99 at a detection floor of 0.1% AF means that there is still 1 FP per 100 bp at a reasonable sensitivity. This means that for a typical 40 kilo base-pair (kbp/kb) sized amplicon panel, one would still find a relatively high number of false positives i.e. about 400. Additionally, Lanman et al. also uses a high read-depth.        
(C) Statistical Treatment of Sequencing Data                This set of approaches treat nucleic acid sequencing data with various statistical methods. For example, there are statistical tests for each sequence base read, typically reported as quality scores. These are then used for alignment quality scoring by either ruling in or ruling out each portion of the read.        In nucleic acid sequencing, replication of the target sample, or simply the target, assays has been used in gene expression studies. In such efforts, the quantity being measured is the copy number change for DNA or RNA molecules, which is typically related to the amount of gene expression i.e. over-expression or under-expression. For one example, see NPL reference “A guide to the whole transcriptome and mRNA Sequencing Service”, dated October 2014 by Exiqon.        Similar suggestions can be found in NPL reference “Statistical Issues in Next-Generation Sequencing”, dated 2009 by Auer et al. and appearing in the proceedings of the 21st Annual Conference on Applied Statistics in Agriculture. This reference suggests the use of 4 sample replicates and two groups of samples, treated and untreated. It then uses Analysis of Variance (ANOVA) models to determine the true variance from noise, where the variance is determined as the change in the copy number for certain genes as compared to a normal sample.        These treatments of genetic data consider the number of observed copies for each gene in a specific state. Because of the high variability of gene expression data and the presence for many genes of a background expression level, multiple measurements are taken for each gene. The determination of the presence of a significant differential expression for a gene consists of comparing these measurements to a reference. Foreign patent references WO2011011426A2 to Shaffer and WO2007089583A2 to Akilesh, and U.S. Pat. No. 9,050,280 B2 to Vlassenbroeck also determine the expression or the numbers of copies of DNA or RNA/DNA.        The use of sequencing data in the above approaches is fundamentally different from determining genetic code alterations. Alterations are defined to include mutations, deletions, translocations and fusions—i.e. changes in the genetic code itself, measured with respect to a wild type background. In other words, the approaches provided by the prevailing art are concerned with detecting the number of copies of a gene, as opposed to the variants of the gene containing mutations to the genetic code itself. That is another shortcoming of the prior art that the instant invention addresses.        
(D) Deep Sequencing, and Reducing the Search Space for Alterations                Another approach that has been proposed is to only identify genetic alterations (or calls) at a small subset of sites within the sites covered by the amplicon panel. NPL reference “An ultrasensitive method for quantitating circulating tumor DNA with broad patient coverage”, dated April 2014 by Newman et al. and appearing in Nature Medicine, proposes performing deep sequencing and then search for mutations only at positions that are known to be present in the solid tumor. This approach is useful in monitoring response to a treatment, but because only a few (typically 2-4) alterations are monitored as a percentage of wild type DNA in blood plasma, the presence/emergence of new mutations is prone to be missed. It further expands the approach to looking at a few hundred positions where alterations are commonly found, but not across the whole amplicon space.        
(E) High Sensitivity Detection Via Molecular Barcoding                Molecular barcoding has been described as a technique particularly suited to reducing the errors, and by extension the false positive (FP) rate. Briefly, the technology consists of the molecular labeling of each starting molecule in the sample, before any amplification and sequencing takes place. The molecular label typically consists of a unique DNA sequence that is added onto the end of the DNA fragments present in the primary sample. All molecules are then amplified and sequenced.        During the analysis, a specialized informatics pipeline is designed to recognize reads that have been generated from the same molecule, and then to collapse all of these reads onto the same sequence by a consensus operation. By doing that, it is shown that the equivalent error rate (error bases/kb) is dramatically reduced with respect to traditional sequencing outputs. The false positive rate is also significantly reduced, resulting in the ability to call mutations in the range of AF 1-2% with relatively high sensitivity and specificity. The reader is referred to NPL references “Reducing amplification artifacts in high multiplex amplicon sequencing by using molecular barcodes”, dated August 2015 by Peng et al. and “Single molecule molecular inversion probes for targeted, high-accuracy detection of low-frequency variation”, dated February 2013 by Hiatt et al. and Lanman et al. for further details.        Despite the above advantages, molecular barcoding methods have significant shortcomings that are barriers to its widespread adoption:        
(i) The Need for a Specialized Chemistry and Bioinformatics Pipeline                Targeted sequencing and other sequencing approaches have been developed by the industry to a stage where there is a lot of content and clinical data available for certain libraries. Molecular barcoding requires different chemistries in the attachment of additional sequences (molecular barcodes) that preclude the use of certain amplicon libraries and make the development of new libraries more difficult.        Specialized bioinformatics are required for the specific barcoding method used. These routines/code are required to collapse the reads into individual molecule sequences. The related expertise adds cost to the commercial viability of these procedures.        
(ii) The Need for Significantly Higher Read-Depths                Molecular barcoding works by obtaining a multitude of reads from a single molecule (or its complementary strand) and collapsing all of these reads into one that best represents the starting molecule. Because this process is imperfect, only a subset of molecules has the required reads per starting molecule to collapse onto the original sequence.        As a result, larger overall read-depths are required for resolving the same percentage mutant AF as compared to non-barcoded methods. For example, if 10 reads per unique molecular barcode are required, and the detection of a 0.1% AF variant ( 1/1000 molecules) is needed, at least a 10,000× read-depth will be required.        
(iii) Loss of Sample Diversity During Barcoding Operations                The relatively low efficiency for the barcode attachment operation reduces the biological diversity of the starting DNA molecules entering the reaction. This leads to reduced sensitivity, especially where the starting number of molecular copies is low. Any molecule that does not ligate to a barcode in that initial step is excluded from the analysis. For instance, for a specific molecular barcoding approach, only less than 10% of molecules present in the starting target samples are labeled with a molecular barcode.        This significantly reduces the biological complexity of samples that are addressable using this method. This also substantially reduces the sensitivity for low AF variant detection. For more details, the reader is referred to NPL reference “Detection of ultra-rare mutations by next-generation sequencing” by Schmitt et al., dated, Sep. 4, 2012, and appearing in the Proceedings of the National Academy of Sciences (PNAS), volume 109. Even for other molecular barcoding approaches, the inclusion rate is typically 30-60% of the initial starting molecules, making this a difficult technique to implement where the starting numbers of copies are low.        
Thus another shortcoming of the prior is that it does not teach techniques for performing high-sensitivity, low FP rate, detection of genetic mutations using samples where the AF percentage is low. For example, the prior does not teach techniques for mutant detection with high sensitivity and specificity where AF ranges include 0.01% to 0.1%, 0.1% to 0.5% or 0.5% to 1% AF.
Another shortcoming of the prior art is that it does not teach statistically comparing sequencing data from multiple replicate target samples, or target replicates, with sequencing data from multiple replicate reference samples, or reference replicates, for the detection of genetic code mutations.
Similarly, the prior art does not teach how to achieve the above sensitivity and specificity without requiring a prohibitively high sequencing depth and therefore at a prohibitively high operational cost.