Several publications and patent documents are referenced in this application in order to more fully describe the state of the art to which this invention pertains. The disclosure of each of these publications and documents is incorporated by reference herein.
Molecular biology and pharmaceutical drug development now make intensive use of nucleic acid analysis. The most challenging areas are whole genome sequencing, single nucleotide polymorphism detection, screening and gene expression monitoring.
In many eukaryotes, between 10% and 30% of cytosine bases are modified by the enzymatic addition of a methyl group to the 5 position of the base. Although this modification does not interfere with the fidelity of DNA replication processes, it enables modulation of diverse cellular processes through protein interactions with hypo- or hyper-methylated sequences. These methylated sequences are not randomly dispersed throughout a genome, but instead are almost exclusively found in repetitive CpG sequences in regulatory regions upstream of many genes. Methylation of these sequences is associated with repression of gene activity and can result in global changes to gene expression. For example, methylation plays a central role in the inactivation of one of the two X chromosomes in female cells, which is a prerequisite for ensuring that females do not produce twice the level of X linked gene products as males. Methylation also underlies the selective repression of either the maternally or paternally inherited copy of pairs of alleles in a process known as genetic imprinting. It also silences transposable elements whose expression would otherwise be deleterious to a genome.
Patterns of methylation in a genome are heritable because of the semi-conservative nature of DNA replication. During this process, the daughter strand, newly replicated on a methylated template strand is not initially methylated, but the template strand directs methyltransferase enzymes to fully methylate both strands. Thus methylation patterns carry an extra level of genetic information down through the generations in addition to that information inherited in the primary sequence of the four nucleotides.
Aberrant patterns of genomic methylation also correlate with disease states and are among the earliest and most common alteration found in human malignancies.
Moreover, mistakes made during the establishment of methylation patterns during development underlie several specific inherited disorders. Consequently, there is a demand for high throughput approaches for profiling the methylation status of many genes in parallel both for research purposes and for clinical applications.
Many methods already exist for detecting the methylation of DNA and they can be broadly classified depending on the level of sequence-specific information they produce. On the simplest level, there are techniques that only yield information on overall levels of methylation within a genome. For example, methylated sequences can be separated from unmethylated sequences on reverse-phase HPLC due to the difference in hydrophobicity of DNase I treated DNA. Such methods are simple but do not provide any information regarding the sequence context of the methylation sites. Alternatively, pairs of restriction endonucleases that recognize the same sequence but have different sensitivities to cytosine methylation at that sequence can be used. Methylation at this sequence will render it refractory to cleavage by one enzyme, but sensitive to the other. If no cytosine bases are methylated in a sequence, both enzymes will produce identically sized restriction fragments. In contrast, if methylation is present, the enzymes will produce different sizes of fragments that can be distinguished by standard analytical techniques such as electrophoresis through agarose. If Southern blot analysis is subsequently performed and the bands probed with a labelled fragment from a gene of interest, then information on the sequence context of the methylation site can be investigated. These methods are limited because they are dependent on the availability of useful restriction enzymes and are confined to the study of methylation patterns among sequences that contain those restriction sites.
5-Methylcytosine (5 mC) is a key epigenetic DNA modification in mammalian genomes. It occurs almost exclusively in the dinucleotide sequence mCpG and plays a central role in development and disease. Among the various methods for large-scale DNA-methylation analysis, only bisulfite sequencing affords single CpG resolution.
Bisulfite deaminates unmethylated cytosine to uracil, while 5 mC is not affected. Sequencing PCR-amplified bisulfite-converted DNA thus displays C and 5 mC as T and C, respectively.
Methods that do not rely on sequence context but which can detect methylation at any chosen sequence are mainly based on the sodium bisulfite reaction. Under controlled conditions, this reagent converts cytosine to uracil while methyl-cytosine remains unmodified. If the treated DNA is then sequenced, the detection of a cytosine indicates that the cytosine is methylated because it would have been otherwise converted to a uracil.
Standard Sanger sequencing procedures have the disadvantage that only a limited number of sequencing reactions can be performed at the same time. Moreover, PCR amplification and sub-cloning may be necessary to produce sufficient quantities of DNA for sequencing, and both methods can introduce artifacts into the sequence, including changes in methylation.
Microarrays comprise molecular probes, such as nucleic acid molecules, arranged systematically onto a solid, generally flat surface or on a collection of beads or microspheres. Each probe site comprises a reagent such as a single stranded nucleic acid, whose molecular recognition of a complementary nucleic acid molecule leads to a detectable signal, often based on fluorescence. Microarrays comprising many thousands of probe sites can be used to monitor gene expression profiles for a large number of genes in a single experiment on a hybridisation based format.
Nucleic acid probes on microarrays are generally made in two ways. A combination of photochemistry and DNA synthesis allows base-by-base synthesis of the probes in situ. This is the approach pioneered by Affymetrix for growing short strands of around 25 bases. Their ‘genechips’ are commercially available and widely used (e.g., Wodlicka et al., 1997, Nature Biotechnology 15:1359-1367), despite the expense of making arrays designed for a particular experiment. Another method for preparing microarrays is to use a robot to spot small (nL) volumes of nucleic acid sequences onto discrete areas of the surface. Microarrays prepared in this manner have less dense features than Affymetrix arrays, but are more universal and cheaper to prepare (e.g., Schena et al., 1995, Science 270:467-470). The main drawback of all types of standard microarrays is the complex hardware required to achieve a spatial distribution of multiple copies of the same DNA sequence.
WO 98/44151 and WO 00/18957 both describe methods of forming polynucleotide arrays based on “solid-phase” nucleic acid amplification, which is analogous to a polymerase chain reaction wherein the amplification products are immobilised on a solid support in order to form arrays comprised of nucleic acid clusters or “colonies”. Each cluster or colony on such an array is formed from a plurality of identical immobilised polynucleotide strands and a plurality of identical immobilised complementary polynucleotide strands. The arrays so-formed are generally referred to herein as “clustered arrays” and their general features will be further understood by reference to WO 98/44151 or WO 00/18957, the contents of both documents being incorporated herein in their entirety by reference.
As aforesaid, the solid-phase amplification methods of WO 98/44151 and WO 00/18957 are essentially a form of the polymerase chain (PCR) reaction carried out on a solid support. Like any PCR, these methods require the use of forward and reverse amplification primers capable of annealing to a template to be amplified. In the methods of WO 98/44151 and WO 00/18957, both primers are immobilised on the solid support at the 5′ end. Other forms of solid-phase amplification are known in which only one primer is immobilised and the other is present in free solution (Mitra, R. D and Church, G. M., Nucleic Acids Research, 1999, Vol. 27, No. 24).
In common with all PCR techniques, solid-phase PCR amplification requires the use of forward and reverse amplification primers which include “template-specific” nucleotide sequences which are capable of annealing to sequences in the template to be amplified, or the complement thereof, under the conditions of the annealing steps of the PCR reaction. The sequences in the template to which the primers anneal under conditions of the PCR reaction may be referred to herein as “primer-binding” sequences.
PCR amplification cannot occur in the absence of annealing of the forward and reverse primers to primer binding sequences in the template to be amplified under the conditions of the annealing steps of the PCR reaction, i.e. if there is insufficient complementarity between primers and template. Some prior knowledge of the sequence of the template is, therefore, required before one can carry out a PCR reaction to amplify a specific template. The user generally must know the sequence of at least the primer-binding sites in the template in advance so that appropriate primers can be designed, although the remaining sequence of the template may be unknown. The need for prior knowledge of the sequence of the template increases the complexity and cost of solid phase PCR of complex mixtures of templates, such as genomic DNA fragments.
Certain embodiments of the methods described in WO 98/44151 and WO 00/18957 make use of “universal” primers to amplify templates comprising a variable template portion that it is desired to amplify, flanked 5′ and 3′ by common or “universal” primer binding sequences. The “universal” forward and reverse primers include sequences capable of annealing to the “universal” primer binding sequences in the template construct. The variable template portion may itself be of known, unknown or partially known sequence. This approach has the advantage that it is not necessary to design a specific pair of primers for each template to be amplified; the same primers can be used for amplification of different templates provided that each template is modified by addition of the same universal primer-binding sequences to its 5′ and 3′ ends. The variable template sequence can therefore be any DNA fragment of interest. An analogous approach can be used to amplify a mixture of templates, such as a plurality or library of template nucleic acid molecules (e.g. genomic DNA fragments), using a single pair of universal forward and reverse primers, provided that each template molecule in the mixture is modified by the addition of the same universal primer-binding sequences.
Such “universal primer” approaches to solid-phase amplification are advantageous since they enable multiple template molecules of the same or different, known or unknown sequence to be amplified in a single amplification reaction on a solid support bearing a single pair of “universal” primers. Simultaneous amplification of a mixture of templates of different sequences by PCR would otherwise require a plurality of primer pairs, each pair being complementary to each unique template in the mixture. The generation of a plurality of primer pairs for each individual template is not a viable option for complex mixtures of templates.
Adaptors that contain universal priming sequences can be ligated onto the ends of templates. The adaptors may be single-stranded or double-stranded. If double-stranded, they may have overhanging ends that are complementary to overhanging ends on the template molecules that have been generated with a restriction endonuclease. Alternatively, the double-stranded adaptors may be blunt ended, in which case the templates are also blunt ended. The blunt ends of the templates may have been formed during a process to shear the DNA into fragments, or they may have been formed by a “polishing” reaction, as would be well known to those skilled in the art, or may have been treated to give a single nucleotide overhang.
A single adaptor or two different adaptors may be used in a ligation reaction with templates. If a template has been manipulated such that its ends are the same, i.e. both are blunt ended or both have the same overhang, then ligation of a single compatible adaptor will generate a template with that adaptor on both ends. However, if two compatible adaptors, adaptor A and adaptor B, are used, then three permutations of ligated products are formed: template with adaptor A on both ends, template with adaptor B on both ends, and template with adaptor A on one end and adaptor B on the other end. This last product is, under some circumstances, the only desired product from the ligation reaction and consequently additional purification steps are necessary following the ligation reaction to purify it from the ligation products that have the same adaptor at both ends.
The above-mentioned prior art methods have inherent shortcomings that limit their utility with respect to a variety of genome wide analyses. Accordingly, the method of the present invention expands the spectrum of genome wide analyses that can be performed.