Several publications and patent documents are referenced in this application in order to more fully describe the state of the art to which this invention pertains. The disclosure of each of these publications and documents is incorporated by reference herein.
Advances in the study of biological molecules have been led, in part, by improvement in technologies used to characterise the molecules or their biological reactions. In particular, the study of the nucleic acids DNA and RNA has benefited from developing technologies used for sequence analysis.
The study of complex genomes, in particular, the search for the genetic basis of disease in humans requires genotyping on a massive scale. Screens for numerous genetic markers performed on populations large enough to yield statistically significant data are needed before associations can be made between a given genotype and a particular disease. However large-scale genotyping is demanding in terms of the cost of both materials and labour involved, and the time taken to perform the study, especially if the methodology employed involves separate serial analysis of individual DNA samples. One shortcut is to pool DNA from many individuals and to determine parameters such as the ratio of changes at certain positions in the genome. Such measurements of ‘allele frequency’ in the pool of samples can be used to correlate the relationship between the changes in the genome sequence and the occurrence of a disease. Hence, an association study involving 1000 patients would in theory only necessitate a ‘one-pot’ reaction for each genetic change. Pooling therefore represents an effective technique for analysing large quantities of samples in a facile manner.
One disadvantage of pooling samples prior to analysis is that information pertaining to individual DNA samples is lost; only global information such as allele frequencies is gathered, as there is no easy method for discerning which individuals gave rise to a particular genotype. An ability to genotype large populations in a small number of reactions, while retaining the information relating to the source of the individual samples, would yield the information content of a full non-pooled population screen in the time and at the cost of a pooled reaction.
Several of the new methods employed for high throughput DNA sequencing (Nature. 437, 376-380 (2005); Science. 309, 5741, 1728-1732 (2005)) rely on a universal amplification reaction, whereby a DNA sample is randomly fragmented, then treated such that the ends of the different fragments all contain the same DNA sequence. Fragments with universal ends can be amplified in a single reaction with a single pair of amplification primers. Separation of the library of fragments to the single molecule level prior to amplification ensures that the amplified molecules form discrete populations that can then be further analysed. Such separations can be performed either in emulsions (Nature. 437, 376-380 (2005); Science. 309, 5741, 1728-1732 (2005)), or on a surface (Nucleic Acids Research 27, e34 (1999); Nucleic Acids Research 15, e87 (2000)).
WO 98/44151 and WO 00/18957 both describe methods of forming polynucleotide arrays based on ‘solid-phase’ nucleic acid amplification, which is a bridging amplification reaction wherein the amplification products are immobilised on a solid support in order to form arrays comprised of nucleic acid clusters or ‘colonies’. Each cluster or colony on such an array is formed from a plurality of identical immobilised polynucleotide strands and a plurality of identical immobilised complementary polynucleotide strands. The arrays so-formed are generally referred to herein as ‘clustered arrays’ and their general features will be further understood by reference to WO 98/44151 or WO 00/18957, the contents of both documents being incorporated herein in their entirety by reference.
In common with all amplification techniques, solid-phase bridging amplification requires the use of forward and reverse amplification primers which include ‘template-specific’ nucleotide sequences which are capable of annealing to sequences in the template to be amplified, or the complement thereof, under the conditions of the annealing steps of the amplification reaction. The sequences in the template to which the primers anneal under conditions of the amplification reaction may be referred to herein as ‘primer-binding’ sequences.
Certain embodiments of the methods described in WO 98/44151 and WO 00/18957 make use of ‘universal’ primers to amplify templates comprising a variable template portion that it is desired to amplify flanked 5′ and 3′ by common or ‘universal’ primer binding sequences. The ‘universal’ forward and reverse primers include sequences capable of annealing to the ‘universal’ primer binding sequences in the template construct. The variable template portion, or ‘target’ may itself be of known, unknown or partially known sequence. This approach has the advantage that it is not necessary to design a specific pair of primers for each target sequence to be amplified; the same primers can be used for amplification of different templates provided that each template is modified by addition of the same universal primer-binding sequences to its 5′ and 3′ ends. The variable target sequence can therefore be any DNA fragment of interest. An analogous approach can be used to amplify a mixture of templates (targets with known ends), such as a plurality or library of target nucleic acid molecules (e.g. genomic DNA fragments), using a single pair of universal forward and reverse primers, provided that each template molecule in the mixture is modified by the addition of the same universal primer-binding sequences.
DNA from more than one source can be sequenced on an array if each DNA sample is first tagged to enable its identification after it has been sequenced. Many low scale DNA-tag methodologies already exist, for example fluorescent labelling (Haughland, Handbook of Fluorescent Probes and Research Products, Invitrogen/Molecular Probes), but these are limited in scope to less than 10 or so reactions in parallel. DNA tags can be added to the ends of DNA fragments by cloning, of example as described in U.S. Pat. No. 5,604,097. The tags consist of eight four base ‘words’, where each word uses only three bases (A, T and C) in various combinations resulting in a total of 16,777,216 different tags that all have the same base pair composition and melting points. Such tags are used to label target molecules in a sample so that after an amplification reaction, each original molecule in the sample has a unique tag. The tags can then be used to ‘sort’ the sample onto beads containing sequences complementary to the tags such that each bead contains multiple copies of a single amplified target sequence (Brenner et al., (2000) Nature Biotechnology, 18, 630). In this application the tags are not sequenced, so the method does not provide a method of analysing targets from multiple samples, but rather a method of sorting a mixture of amplified templates from a single sample. The problem with enabling the method for individual samples rather than individual molecules is that the tags are synthesised in a combinatorial manner, meaning that all 16,777,216 different sequences are obtained in a single mixture in the same tube. Whilst this is ideal for treating one sample such that each individual molecule in the sample carries a different tag, it does not permit attachment of the same tag to every molecule in the sample.
DNA samples from multiple sources can, however, be tagged with different nucleic acid tags such that the source of the sample can be identified. Previous application WO05068656 describes the generic concept of indexing samples. In order to utilise this invention on arrays of amplified single molecule templates, for example as described in WO9844151, WO06099579 or WO04069849, it is advantageous to prepare the nucleic acids using the novel method described herein. The optimised DNA sample preparation techniques described herein are applicable to any method where the samples are amplified prior to sequencing. The DNA sample preparation techniques presented herein describe in detail the optimal placements of the sequencing primers and indexed tags within the DNA constructs to be sequenced.