‘Molecular barcoding’ was developed to address problems generated by raw error rates intrinsic to DNA sequence machines (synthetic accuracy), and also problems related to counting individual nucleic acid molecules within a sample (molecular counting).
Molecular barcoding generally involves attaching (for example, by ligation or by primer-extension) a unique nucleic acid label (a ‘barcode’) to several single target molecules (DNA or RNA) in a solution containing a large number of such molecules. These labelled molecules are then sequenced, which for each reveals both the sequence of the molecular barcode, and at least part of the sequence of the labelled target molecule itself.
This barcoding is typically used towards two different ends. First, it can be used to enable ‘redundant sequencing’. For example, imagine a nucleic acid sample containing 1000 copies of a particular gene in a DNA sample; 999 of the copies hold sequences identical to each other, but a single copy has a particular single-nucleotide mutation.
Without barcoding, the sequencer will be unable to detect this mutated copy, since the sequencer makes random errors at a higher rate than 1:1000—i.e. the mutation is so rare in the population of sequenced molecules that it falls below the sequencer's intrinsic background noise threshold.
However, if the 1000 copies have each been labelled with a unique molecular barcode, and each individual labelled molecule is sequenced several times by the sequencing machine (redundant sequencing), you would observe that every time (or, at least 99% of the time, equivalent to the raw accuracy of the sequencer) that the labelled mutated molecule was redundantly sequenced (i.e, every time the target gene sequence was observed to be labelled with that one particular unique barcode that was attached to the mutated starting molecule), that the same apparent mutation would in fact be observed. By contrast, that particular mutation would only be observed approximately 1% of the time (the raw error rate of the sequencer) when the labelled but non-mutated gene copies were redundantly sequenced, as per their respective alternative barcodes.
The barcode thus serves to identify individual input molecules across all their respective multiple copies within the sequencing reaction, allowing a sequence-detection algorithm to specifically focus on their respective reads within a sequencing dataset, and thus avoiding the large amount of stochastic sequence noise (in the form of sequence errors) that is present across the remainder of the dataset. This thus enables ‘synthetic accuracy’, through redundant sequencing, which is potentially much higher than the raw accuracy of the sequencer itself.
Barcoding can also be used to enable digital ‘molecular counting’ of input DNA or RNA molecules. In this process, a large number of unique barcodes are attached to input molecules, for example, cDNA copies that have been made from a particular mRNA species. Each input cDNA molecule is labelled (for example, by primer extension) with a single, unique barcode. The molecules are then sequenced, which, as with redundant sequencing, reveals the unique barcode and at least part of each associated labelled input molecule; these molecules are then also each sequenced more than once.
Instead of using this redundant sequencing to reduce sequencing errors, in molecular counting it is used to digitally quantify how many individual molecules of the given target molecule (cDNA in this case) were present in the original sample, by simply counting the total number of unique barcodes that were sequenced and found to be associated with the particular target. Barcode-directed redundant sequencing in this way reduces the chance that any input molecule is stochastically left unsequenced by the sequencing reaction (since each labelled molecule on average is sequenced several times), whilst retaining an accurate measure of input quantity (since redundantly sequenced starting molecules are only counted once, as discriminated by repeated copies of their unique barcode).
Examples of the use of molecular barcodes are provided in U.S. Pat. Nos. 8,728,766, 8,685,678, 8,722,368, Kinde et al., 2011 (PNAS, 108, 23, 9530-9535) and US 20140227705 A1.
A ‘synthetic long read’ is generated when a long, contiguous sequence of DNA (longer than the readlength attainable on a DNA sequencer) is converted into two or more shorter ‘sub-sequences’ that are short enough to be read by a DNA sequencer, and which are somehow labelled such that it can be deduced (after sequencing) that the sub-sequences were generated from the same original long DNA sequence. For example, if you want to sequence a particular human gene which is 1000 nucleotides long, but do so with a short-read DNA sequencer with a readlength of 100 nucleotides, you could separate the long sequence into 10 different sub-sequences of 100 nucleotide length, then label each of these 10 sub-sequences with a synthetic, informative ‘label’ DNA sequence that identifies each of the 10 sub-sequences as coming from the same original 1000 nucleotide DNA molecule, then perform high-throughput DNA sequencing with these 10 resulting DNA molecules, and thus (for each of the 10 resulting DNA molecules) attain both the 100 nucleotide sub-sequence, and the associated identifying DNA label. With this high-throughput DNA data an algorithm can be used which detects these identifying labels and uses them to associate the 10 different 100-nucleotide subsequences with each other as a collective sub-sequence ‘grouping’, and therewith estimate that the 10 sub-sequences came from a longer, 1000-nucleotide gene, and therewith estimate the total 1000-nucleotide long genetic sequence by ‘stitching’ the 10 sub-sequences together in silico into a single 1000-nucleotide long gene.
Two main synthetic long read technologies which have been described in the literature: a partitioning-based approach which is described in US 20130079231 A1; and a barcode-copying approach which is described in Casbon et al., 2013 (Nucleic Acids Research, 2013, 41, 10, e112), U.S. Pat. Nos. 8,679,756 and 8,563,274.
‘Spatial sequencing’ is considered to be the sequencing of nucleic acids with the inclusion of some information about where each sequenced nucleic acid is located within a particular space (for example, within a particular sample, or within a particular cell). However, very few spatial sequencing methods are known. The main known technology is the fluorescent in situ RNA sequencing (FISSEQ) technique. In FISSEQ a sample of cells are cross-linked, and while the cells are still intact, RNA is reverse transcribed into cDNA, and amplified whilst still in the crosslinked cells. Then, each amplified cDNA molecule is sequenced optically whilst still in the cells, with a high-powered and sensitive optical detection system. This method is described in Lee et al., 2014 (Science, 343, 6177, 1360-1363).
The invention addresses two main types of problem in the sequencing field: 1) specific analytic limitations of DNA sequencing machines; and 2) biophysical challenges associated with common types of experimental DNA samples.
Current high-throughput DNA-sequencing machines are powerful platforms used to analyse large amounts of genetic material (from thousands to billions of DNA molecules) and function as systems for both basic research and applied medical applications. However, all current DNA sequencing machines are subject to certain analytic limitations which constrain the scientific and medical applications in which they can be effectively used. The chief such limitations include finite raw readlengths and finite raw accuracy, both of which are described below.
With regard to finite raw readlengths, each DNA sequencing platform is characterised by a typical ‘readlength’ that it can attain, which is the ‘length’ in nucleotides of DNA that it can ‘read’ of each sequenced molecule. For most sequencing machines, this ranges from 100 to ˜500 nucleotides.
With regard to finite raw accuracy, each sequencing platform is also characterised by an attainable ‘raw accuracy’, typically defined as the likelihood that each given nucleotide it sequences has been determined correctly. Typical raw accuracy for the most popular sequencing platforms range between 98 and 99.5%. The related quantity, the ‘raw error’ rate, is essentially the converse of raw accuracy, and is the per-nucleotide likelihood that the sequencer randomly reports an incorrect nucleotide in a particular sequenced DNA molecule.
In addition, certain common experimental DNA samples pose biophysical challenges for sequencing. These challenges arise from the unique (and troublesome) molecular state of DNA in these samples, which makes it difficult to sequence them or to extract important pieces of genetic information therefrom, irrespective of the sequencing machine employed. For example, Formalin-Fixed Paraffin-Embedded (FFPE) samples are the standard experimental tool for performing molecular pathology from human biopsy specimens. However, the process of creating an FFPE sample—in which the biopsy specimen is fixed (crosslinked and kept physically together and stable at the molecular level) by a harsh chemical, and then embedded in a wax—creates significant damage to the DNA and RNA contained therein. DNA and RNA from FFPE samples is thus heavily fragmented (generally into small fragments between 50 and 200 nucleotides), and also includes sporadic damage to individual nucleotides which makes it essentially impossible to amplify or isolate long, contiguous sequences.