Advances in the study of biological molecules have been led, in part, by improvement in technologies used to characterise the molecules or their biological reactions. In particular, the study of the nucleic acids DNA and RNA has benefited from developing technologies used for sequence analysis.
Methods for sequencing a polynucleotide template can involve performing multiple extension reactions using a DNA polymerase or DNA ligase, respectively, to successively incorporate labelled nucleotides or polynucleotides complementary to a template strand. In such “sequencing by synthesis” reactions a new nucleotide strand base-paired to the template strand is built up by successive incorporation of nucleotides complementary to the template strand. The substrate nucleoside triphosphates or oligonucleotides used in the sequencing reaction are typically blocked to prevent over-incorporation. The substrate nucleoside triphosphates or oligonucleotides can also be labelled, permitting determination of the identity of the incorporated nucleotide(s) as successive nucleotides are added.
In order to carry out accurate sequencing using nucleoside triphosphates, a reversible chain-terminating structural modification or “blocking moiety” may be added to the substrate nucleotides to ensure that nucleotides are incorporated one at a time in a controlled manner. As each single nucleotide is incorporated, the blocking moiety prevents any further nucleotide incorporation into the polynucleotide chain. Once the identity of the last-incorporated labelled nucleotide has been determined the label moiety and blocking moiety are removed, allowing the next blocked, labelled nucleotide to be incorporated in a subsequent round of sequencing.
In certain circumstances the amount of sequence data that can be reliably obtained with the use of sequencing-by-synthesis techniques, particularly when using blocked, labelled nucleotides, may be limited. In some circumstances the sequencing “run” may be limited to a number of bases that permits sequence realignment with the human genome, for example around 50-100 cycles of incorporation. Whilst sequencing runs of this length are extremely useful, particularly in applications such as, for example, SNP analysis and genotyping, it would be advantageous in many circumstances to be able to reliably obtain further sequence data for the same template molecule.
The technique of “paired-end” or “pairwise” sequencing is generally known in the art of molecular biology, particularly in the context of whole-genomic shotgun sequencing. Many applications in DNA sequencing use paired-end methods to obtain sequence information on a length scale longer than an individual read. Paired-end sequencing allows the determination of two “reads” of sequence from two places on a single polynucleotide duplex. The advantage of the paired-end approach is that there is significantly more information to be gained from sequencing two stretches each of “n” bases from a single template than from sequencing “n” bases from each of two independent templates in a random fashion. With the use of appropriate software tools for the assembly of sequence information it is possible to make use of the knowledge that the “paired-end” sequences are not completely random, but are known to occur on a single duplex, and are therefore linked or paired in the genome. This information has been shown to greatly aid the assembly of whole genome sequences into a consensus sequence. It is especially advantageous for the alignment and assembly of the genome sequences if each of the fragments is of a defined known length such that the distance between the two reads is accurately defined and controlled.
Paired-end sequencing has typically been performed by making use of specialized circular shotgun cloning vectors known in the art. After cutting the vector at a specific single site, the template DNA to be sequenced (typically genomic DNA) is inserted into the vector and the ends resealed to form a new construct. The vector sequences flanking the insert DNA include binding sites for sequencing primers which permit sequencing of the insert DNA at each end and on opposite strands.
A disadvantage of this approach is that it requires time-consuming cloning of the DNA sequencing templates into an appropriate sequencing vector. Furthermore, there is little to no control of the length of the fragments inserted into the vector. Moreover, cloning the DNA template into a vector, although allowing binding sites for sequencing primers to be positioned at both ends of the template fragment, can be cumbersome and inefficient when used for array-based sequencing techniques. With array-based techniques a sequence is generally read from one end of a nucleotide template, this often being the end proximal to the point of attachment to the array. However, a variety of methods for double-ended sequencing of a polynucleotide template are known.
Also known are methods of nucleic acid amplification which generate amplification products immobilised on a solid support in order to form arrays comprised of clusters or “colonies” formed from a plurality of identical immobilised polynucleotide strands and a plurality of identical immobilised complementary strands. The nucleic acid molecules present in DNA colonies on the clustered arrays prepared according to these methods can provide templates for sequencing reactions but only a single sequencing read is typically obtained from one type of immobilised strand in each colony.
An exemplary method that is useful for paired end sequencing on clusters uses three grafted primers. This method is applicable to templates that can be amplified using bridge amplification, and the length of the templates used may be up to, for example, 1000 base pairs or so, however for many DNA sequencing applications, it may be necessary to determine a sequence read from either end of a target fragment of greater length. The sample preparation methods described in the present invention allow the analysis of a pair of reads for the ends of a fragment of any length.
An alternative method for sequencing both ends of a cluster is the use of strand resynthesis. Once a first read is completed, the template can be copied using immobilised primers to generate a second template strand. The second template strand can be sequenced to give a second read. Thus the cluster is sequenced from both ends.
In preparing DNA for these applications, a narrow length distribution is desired because it offers increased bioinformatic power. In resequencing for example, if it is known that a fragment is N bases +/−M bases, then it is possible to detect insertions and deletions in the non-sequenced part of the fragment which are at least M bases. The smaller M can be, the more powerful the paired end data is. In the limit where M=0, i.e. where the fragment length is known exactly, it is possible to detect even single base insertions or deletions in the unread part of the fragment, by noting spacing of the reads when mapped to the reference that is not as expected. Single base insertions and deletions are common in the human genome and thus some of the most important to detect. By creating molecule sets which are exact in length, detection of single base insertions and deletions by paired-end sequencing becomes possible with much lower sequencing coverage and thus at much lower cost.
A standard method used to obtain a specific narrow fragment length distribution has two steps. The first step fragments the source target DNA (which is originally very long—e.g. Genomic DNA) into shorter pieces. Methods used include sonication, forcing through a nozzle that forms tiny droplets (nebulisation), heat and radiation. These methods typically result in random fragments (e.g. Minimal base-composition bias) but wide length distributions. The second step is to use a separation technique, e.g. Electrophoresis or HPLC, to resolve this size distribution so that a small fraction can be extracted that has a narrow size distribution. With manual electrophoresis, a slice or stab may be manually taken from a slab gel. With HPLC or automated capillary electrophoresis, an automated fraction collector could be used.
There are two problems with these approaches. First, the size selection process throws away most of the DNA (i.e. all the fragments which are the wrong length). This is a significant waste, particularly where the amount of source DNA can be very limited (e.g. as in a tumor biopsy). The narrower the size range selected, the lower the percentage utilization of the input DNA.
The second problem is that the narrowness of the size range is determined by the separation precision of the separation technique used. There are both theoretical limits to the precision and practical difficulties in achieving those theoretically possible levels. In any technique based on the separation of physical fragments it is very difficult to obtain a collection of fragments where each fragment is exactly the same length. The inventors herein have therefore developed a method of preparing fragments of defined length for the purposes of paired end sequencing.