The present invention relates to a method for characterising DNA, especially cDNA, so that the DNA may be identified, for example, from a population of DNAs. The invention also relates to a method for assaying the DNA.
Analysis of complex nucleic acid populations is a common problem in many areas of molecular biology, nowhere more so than in the analysis of patterns of gene expression. Various methods have been developed to allow simultaneous analysis of entire mRNA populations, or their corresponding cDNA populations, to enable us to begin to understand patterns of gene expression in vivo.
Present methods, however, suffer from numerous drawbacks. The simplest methods such as xe2x80x98subtractive cloningxe2x80x99 allow crude comparative information about differences in gene expression between related cell types to be derived, although these methods have had moderate success in isolating rare cDNAs. Other methods such as xe2x80x98differential displayxe2x80x99 and related xe2x80x98molecular indexingxe2x80x99 methods allow broader comparisons of gene expression between cell types but embodiments of these methods to date have been difficult to automate and are dependant on gel electrophoresis for analysis. Still more informative methods have arrived recently such as SAGE, Serial Analysis of Gene Expression, which give quantitative data on gene expression without prior knowledge and can readily and specifically identify cDNAs expressed in a given cell type but at the cost of excessive sequencing.
The method of xe2x80x9csubtractive cloningxe2x80x9d (Lee et al, Proc. Nat. Acad. Sci. USA 88, 2825-2829) allows identification of mRNAs, or rather, their corresponding cDNAs, that are differentially expressed in two related cell types. One can selectively eliminate cDNAs common to two related cell types by hybridising cDNAs from a library derived from one cell type to a large excess of mRNA from a related, but distinct cell type. mRNAs in the second cell type complementary to cDNAs from the first type will form double-stranded hybrids. Various enzymes exist which degrade such ds-hybrids allowing these to be eliminated thus enriching the remaining population in cDNAs unique to the first cell type.
The method of xe2x80x9cdifferential displayxe2x80x9d (Laing and Pardee, Science 257, 967-971, 1992) sorts mRNAs using PCR primers to selectively amplify specific subsets of an mRNA population. An mRNA population is sub-divided into aliquots, each of which is primed with a series of xe2x80x9canchoredxe2x80x9d poly-T primers to effect reverse transcription with normalisation of the length of the poly-A tail. A set of redundant gene specific primers, of maybe 10 nucleotides or so are used to amplify the reverse strand. Typically a set of 30 such primers are used. In this way mRNAs are characterised by the length of their amplification products. The resultant amplified sub-populations can then be cloned for screening or sequencing or the fragments can simply be separated on a sequencing gel. Low copy number mRNAs are less likely to get lost in this sort of scheme in comparison with subtractive cloning, for example, and it is probably marginally more reproducible. Whilst this method is more general than subtractive cloning, time-consuming analysis is required. Unfortunately with these methods each cDNA may have multiple amplification products. Furthermore, the methods are not quantitative and comparative information can only be determined for relatively closely related cell types, e.g. diseased and normal forms of a particular tissue from the same organism.
The method of serial analysis of gene expression (Velcelescu et al., Science 270, 484-487, 1995) allows identification of mRNAs, or rather, their corresponding cDNAs that are expressed in a given cell type. It gives quantitative information about the levels of those cDNAs as well. The process involved isolating a signature xe2x80x98tagxe2x80x99 from every cDNA in a population using adaptors and type IIs restriction endonucleases. A tag is a sample of a cDNA sequence of a fixed number of nucleotides sufficient to uniquely identify that cDNA in the population. Tags are then ligated together and sequenced. The method gives quantitative data on gene expression and will readily identify novel cDNAs.
Methods involving hybridisation grids, chips and arrays are advantageous in that they avoid gel methods for sequencing and are quantitative. They can be performed entirely in solution, thus are readily automatable. Such arrays of oligonucleotides are a relatively novel approach to nucleic acid analysis, allowing mutation analysis, sequencing by hybridisation and mRNA expression analysis. For gene expression analysis oligonucleotides complementary to and unique to known RNAs can be arrayed on a solid phase support such as a glass slide or membrane. Labelled cDNAs or mRNA are hybridised to the array. The appearance of labelled nucleic acid immobilised at a specific locus on the array is indicative of the presence of the corresponding mRNA to which the oligonucleotide at that locus is complementary. Methods of construction of such arrays have been developed, (see for example: A. C. Pease et al. Proc. Natl. Acad. Sci. USA. 91, 5022-5026, 1994; U. Maskos and E. M. Southern, Nucleic Acids Research 21, 2269-2270, 1993; E. M. Southern et al., Nucleic Acids Research 22, 1368-1373, 1994) and further methods are envisaged. Unfortunately, these methods require that the sequence of RNAs be known prior to construction of the array. This means that this approach is not applicable to organisms for which little or no information is known.
Immobilisation can be followed by partial sequencing of fragments by a single base method, e.g. using type IIs restriction endonucleases and adaptors. This particular approach is advocated by Brenner in PCT/US95/12678.
Arrays of oligonucleotides of N bp length can be employed. The array carries all 4N possible oligonucleotides at specific points on the grid. Nucleic acids are hybridised as single strands to the array. Detection of hybridisation is achieved by fluorescently labelling each nucleic acid and determining from where on the grid the fluorescence arises, which determines the oligonucleotide to which the nucleic acid has bound. The fluorescent labels also give quantitative information about how much nucleic acid has hybridised to a given oligonucleotide. This information and knowledge of the relative quantities of individual nucleic acids should be sufficient to reconstruct the sequences and quantities of the hybridising population. This approach is advocated by Lehrach in numerous papers and Nucleic Acids Research 22, 3423 contains a recent discussion. A disadvantage of this approach is that the construction of large arrays of oligonucleotides is extremely technically demanding and expensive.
The method of xe2x80x9cmolecular indexingxe2x80x9d (PCT/GB93/01452) uses populations of adaptor molecules to hybridise to the ambiguous sticky-ends generated by cleavage of a nucleic acid with a type IIs restriction endonuclease to categorise the cleavage fragments. Using specifically engineered adaptors one can specifically immobilise or amplify or clone specific subsets of fragments in a manner similar to differential display but achieving a greater degree of sorting and control. However, time-consuming analysis is required and the methods disclosed in this patent application are difficult and expensive to automate.
The method of Kato (Nucleic Acids Research 23, 3685-3690, 1995) exemplifies the above molecular indexing approach and effects cDNA population analysis by sorting terminal cDNA fragments into sub-populations followed by selective amplification of specific subsets of cDNA fragments. Sorting is effected by using type IIs restriction endonucleases and adaptors. The adaptors also carry primer sites which in conjunction with general poly-T primers allows selective amplification of terminal cDNA fragments as in differential display. It is possibly more precise than differential display in that it effects greater sorting: only about 100 cDNAs will be present in a given subset and sorting can be related to specific sequence features rather than using primers chosen by trial and error. The subsets can then be analysed by gel electrophoresis to separate the fragments by length and generate a profile of mRNAs in a tissue. This method is dependant on PCR amplification which distorts the frequencies of each cDNA present. Furthermore the methods of analysis used so far have been dependant on gel electrophoresis.
The Gene Profiling technology described in patent PCT/GB97/02403 provides a further method of molecular indexing for the analysis of patterns of gene expression in a cell by sampling each cDNA within the population of that cell. In one embodiment, the sampling system takes two samples of 4 bp from each cDNA in a population and determines their sequence with respect to a defined reference point. The methods of this invention are amenable to automation but require many steps to derive signature information.
EP-A-735144 discloses a method for characterising cDNA. An array of adaptors is used to identify a short sequence sample at the 5xe2x80x2 terminus of a 3xe2x80x2 terminal restriction fragment of a cDNA. The cDNA is generated by cleavage with a type IIS restriction endonuclease. The adaptors introduce a primer sequence into the terminus of the fragments. The sequence identifies the sticky end generated by the cleavage. The adaptored fragments are cleaved with a further type IIS restriction endonuclease and fragments are selectively amplified using the adaptor primer sequence and a poly-T primer. This process is used to resolve a population of terminal restriction fragements. The method allows further resolution by separating the fragments according to sequence length.
U.S. Pat. No. 5,508,169 discloses a population of adaptor molecules which may be hybridised and ligated to ambiguous sticky ends generated by cleavage of a nucleic acid with a type IIS restriction endonuclease. The adaptors are disclosed in relation to the construction of a xe2x80x9cuniversal endonucleasexe2x80x9d.
All of the above methods are relatively laborious and rely upon sequencing by traditional gel methods. Moreover, the methods require amplification by PCR, which is prone to produce artefacts.
It is an object of this invention to provide a method of gene expression profiling that is amenable to high throughput and automation which has great sensitivity. In this way should be possible to avoid the need for exponential amplification of cDNAs which distorts the frequencies of the cDNAs which is essential information in interpreting changes in gene expression patterns between different states of a given tissue and between different tissues of the same organism which have differentiated differently. This invention provides methods to derive a signature for each cDNA in a library which require fewer steps hence reducing sample loss and distortion of quantities of each mRNA by exploiting restriction fragment length polymorphisms to provide information about cDNAs.
Accordingly, the present invention provides a method for characterising cDNA, which comprises:
(a) exposing a sample comprising a population of one or more cDNAs or fragments thereof to a cleavage agent which recognises a predetermined sequence and cuts at a reference site at a known displacement from the predetermined sequence proximal to an end of each cDNA or fragment thereof so as to generate a population of terminal fragments;
(b) ligating to each reference site an adaptor oligonucleotide which comprises a recognition site for a sampling cleavage agent;
(c) exposing the population of terminal fragments to a sampling cleavage agent which binds to the recognition site and cuts at a sampling site of known displacement from the recognition site so as to generate in each terminal fragment a sticky end sequence of a predetermined length of up to 6 bases, preferably 3 to 5 bases, and of unknown sequence;
(d) separating the population of terminal fragments into sub-populations according to sequence length; and
(e) determining each sticky end sequence.
It is not necessary to sequence an entire cDNA to identify uniquely its presence; only a short xe2x80x98signaturexe2x80x99 of a few base pairs should be sufficient to identify uniquely all cDNAs, given, for example, a total cDNA population of about 80 000 in the human genome. Given also that in the next few years the entire human genome will have been sequenced, it should be possible to use such signatures derived by this process to acquire the entire sequence of the original cDNAs from a sequence database. With the incomplete database that already exists, signatures that return no sequence from the database will probably be novel and this process will readily allow them to be isolated for complete sequencing.
The cleavage agent is preferably a type II restriction endonuclease. In this case the reference site will contain the predetermined sequence (i.e. the known displacement will be zero) Alternatively, a type IIs restriction endonuclease or a chemical agent coupled to an oligonucleotide may be used. A sticky end or a blunt end may be generated although a sticky end is preferred.
Preferably each terminal fragment has a poly A tail. This provides a useful method for identifying the terminal fragment using a poly-T primer for reverse transcription. Alternatively, the 5xe2x80x2 cap of the cDNA may be targeted.
In more detail, the first aspect of the present invention is a method which comprises the steps of:
1) generating xe2x80x98anchoredxe2x80x99 cDNA captured on a solid phase support at the poly-T terminus. The cDNA is preferably methylated;
2) cleaving the cDNA fragments with a type II restriction endonuclease, and washing away cleaved fragments. Preferably the type II restriction endonuclease generates a known sticky-end;
3) ligating double stranded adaptors to the restricted cDNAs. Preferably the adapters bear a single stranded overlap complementary to a known sticky end generated by the restriction endonuclease from step (2) above. The double stranded region of the adapter bears a recognition sequence for a type IIs restriction endonuclease;
4) contacting the adaptored cDNAs with a type IIs restriction endonuclease to cleave the adapters from the cDNAs leaving an ambiguous sticky end of a predetermined length;
5) ligating a set of double stranded adaptors to the restricted cDNAs. The set of adaptors preferably comprises adapters bearing all possible single base extensions complementary to the ambiguous sticky-end of predetermined length generated in step
(4). The adapters further comprise a mass label, cleavably linked to the adapter at the 5xe2x80x2 distal from the ligation site, that uniquely identifies the sequence of the overlap of each adapter in the set when analysed by mass spectrometry. Optionally, each adapter may additionally comprise a primer sequence, such that each adapter has a unique primer sequence which corresponds to its overlapping sticky-end;
6) preferably conditioning the captured cDNAs for mass spectrometry;
7) denaturing the free strand from the captured strand releasing it into solution. This strand should bear the mass-label;
8) analysing the mass labelled cDNA terminal restriction fragments by Capillary Electrophoresis Mass Spectrometry.
The second aspect of the present invention is a method which comprises the steps of:
1) generating xe2x80x98anchoredxe2x80x99 cDNA captured on a solid phase support at the poly-T terminus. The cDNA is preferably methylated;
2) cleaving the cDNA fragments with a type II restriction endonuclease, and washing away cleaved fragments. Preferably the type II restriction endonuclease generates a known sticky-end;
3) ligating double stranded adaptors to the restricted cDNAs. Preferably the adapters bear a single stranded overlap complementary to a known sticky end generated by the restriction endonuclease from step (2) above. The double stranded region of the adapter bears a recognition sequence for a type IIs restriction endonuclease;
4) contacting the adaptored cDNAs with a type IIs restriction endonuclease to cleave the adapters from the cDNAs leaving an ambiguous sticky end of a predetermined length;
5) ligating a set of double stranded adaptors to the restricted cDNAs. The set of adaptors preferably comprises adapters bearing all possible single base extensions complementary to the ambiguous sticky-end of predetermined length generated in step
(4). The adapters further comprise a mass label, cleavably linked to the adapter at the 5xe2x80x2 distal from the ligation site, that uniquely identifies the sequence of the overlap of each adapter in the set when analysed by mass spectrometry. Optionally, each adapter may additionally comprise a primer sequence, such that each adapter has a unique primer sequence which corresponds to its overlapping sticky-end;
6) denaturing the free strand from the captured strand releasing it into solution. This strand should bear the mass label. The captured strands are thus rendered single stranded;
7) contacting the captured single stranded with mass labelled primers complementary to the primer sequence provided by the adapters. The mass label attached to each primer identifies the sticky-end of the adapter to which the primer is complementary. Primers are preferably non-complementary and have equalised melting temperatures and can thus be added simultaneously. Optionally a second primer or set of primers may be used. These may be the anchored primers used in the synthesis of cDNA or may be a primer complementary to a site provided 5xe2x80x2 of the anchored poly-T sequence;
8) extending primers in correctly hybridised duplexes with a DNA polymerase in the presence nucleotide triphosphates. This may be an exponential amplification if a second primer or set of primers is used;
9) melting the extended labelled strands off the immobilised template;
10) preferably conditioning the captured cDNAs for mass spectrometry;
11) determining the length of each of the amplified fragments and determining the identity of each of the amplified fragments by detection of the label incorporated with its primer. This detection if preferably performed by capillary electrophoresis mass spectrometry.
PCT/GB98/00127 describes nucleic acid probes labelled with markers that are resolvable by mass spectrometry. Such mass labelled probes would permit the analysis described here to be performed very rapidly as a captured library of restriction fragments can be probed with a number of uniquely mass labelled primers simultaneously.
The construction of adaptor oligonucleotides is well known and details and reviews are available in numerous texts, including: Gait, M. J. editor, xe2x80x98Oligonucleotide Synthesis: A Practical Approachxe2x80x99, IRL Press, Oxford, 1990; Eckstein, editor, xe2x80x98Oligonucleotides and Analogues: A Practical Approachxe2x80x99, IRL Press, Oxford, 1991; Kricka, editor, xe2x80x98Nonisotropic DNA Probe Techniquesxe2x80x99, Academic Press, San Diego, 1992; Haugland, xe2x80x98Handbook of Fluorescent Probes and Research Chemicalsxe2x80x99, Molecular Probes, Inc., Eugene, 1992; Keller and Manack, xe2x80x98DNA Probes, 2nd Editionxe2x80x99, Stockton Press, New York, 1993; and Kessler, editor, xe2x80x98Nonradioactive Labeling and Detection of Biomoleculesxe2x80x99, Springer-Verlag, Berlin, 1992.
Conditions for using such adaptors are also well known. Details on the effects of hybridisation conditions for nucleic acid probes are available, for example, in any one of the following texts: Wetmur, Critical Reviews in Biochemistry and Molecular Biology, 26, 227-259, 1991; Sambrook et al, xe2x80x98Molecular Cloning: A Laboratory Manual, 2nd Editionxe2x80x99, Cold Spring Harbour Laboratory, New York, 1989; and Hames, B. D., Higgins, S. J., xe2x80x98Nucleic Acid Hybridisation: A Practical Approachxe2x80x99, IRL Press, Oxford, 1988.
Likewise, ligation of adaptors is well known and chemical methods of ligation are discussed, for example, in Ferris et al, Nucleosides and Nucleotides 8, 407-414, 1989; and Shabarova et al, Nucleic Acids Research 19, 4247-4251, 1991.
Preferably, enzymatic ligation would be used and preferred ligases are T4 DNA ligase, T7 DNA ligase, E. coli DNA ligase, Taq ligase, Pfu ligase, and Tth ligase. Details of such ligases are found, for example, in: Lehman, Science 186, 790-797, 1974; and Engler et al, xe2x80x98DNA Ligasesxe2x80x99, pg 3-30 in Boyer, editor, xe2x80x98The Enzymes, Vol 15Bxe2x80x99, Academic Press, New York, 1982. Protocols for the use of such ligases can be found in: Sambrook et al, cited above; Barany, PCR Methods and Applications, 1: 5-16, 1991; and Marsh et al, Strategies 5, 73-76, 1992.
One potential problem with the use of adaptors is to ensure that hybridisation of probes is accurate. There are major differences between the stability of short oligonucleotide duplexes containing all Watson-Crick base pairs. For example, duplexes comprising only adenine and thymine are unstable relative to duplexes of guanine and cytosine only. These differences in stability can present problems when trying to hybridise mixtures of short oligonucleotides (e.g. 4 mers) to complementary target DNA. Low temperatures are needed to hybridise A-T rich sequences but at these temperatures G-C rich sequences will hybridise to sequences that are not fully complementary. This means that some mismatches may happen and specificity can be lost for the G-C rich sequences. At higher temperatures G-C rich sequences will hybridise specifically but A-T rich sequences will not hybridise.
In order to normalise these effects modifications can be made to the Watson-Crick bases. The following are examples but they are not limiting:
The adenine analogue 2,6-diaminopurine forms three hydrogen bonds to thymine rather than two and therefore forms more stable base pairs.
The thymine analogue 5-propynyl dU forms more stable base pairs with adenine.
The guanine analogue hypoxanthine forms two hydrogen bonds with cytosine rather than three and therefore forms less stable base pairs.
These and other possible modifications should make it possible to compress the temperature range at which random mixtures of short nucleotides can hybridise specifically to their complementary sequences.
Preferably, the sampling cleavage agent comprises a type IIs restriction endonculease. Type IIs restriction endonucleases, the xe2x80x98sampling endonucleasesxe2x80x99, have the property that they recognise and bind to a specific sequence within a target DNA molecule, but they cut at a defined distance away from that sequence generating single-stranded sticky-ends of known length but unknown sequence at the cleavage termini of the restriction products.
For example, the enzyme fok1, generates an ambiguous (i.e. unknown) sticky-end of 4 bp, 9 bp downstream of its recognition sequence. This ambiguous sticky-end could thus be one of 256 possible 4 bp oligonucleotides (see FIG. 1). Numerous other type IIs restriction endonucleases exist and could be used for this process as discussed below in section on restriction endonucleases. Their binding site can be provided by the adaptors used as shown in FIG. 2, for example.
Numerous type IIs restriction endonucleases exist and could be used as sampling enzymes for this process. Table 1 below gives a list of examples but is by no means comprehensive. A literary review of restriction endonucleases can be found in Roberts, R., J. Nucl. Acids Res. 18, 2351-2365, 1988. New enzymes are discovered at an increasing rate and more up to date listings are recorded in specialist databases such as REBase which is readily accessible on the internet using software packages such as Netscape or Mosaic. REBase lists all restriction enzymes as they are discovered and is updated regularly, moreover it lists recognition sequences and isoschizomers of each enzyme and manufacturers and suppliers. The spacing of recognition sites for a given enzyme within an adaptor can be tailored according to requirements and the enzyme""s cutting behaviour. (See FIG. 2 above).
The requirement of the process is the generation of ambiguous sticky-ends at the termini of the nucleic acids being analysed. This could also be achieved by controlled use of 5xe2x80x2 to 3xe2x80x2 exonucleases. Clearly any method that achieves the creation of such sticky-ends will suffice for the process.
Similarly the low stringency restriction endonuclease is necessary only to cleave each cDNA once, preferably leaving sticky-ends. Any means, however, of cleaving the immobilised nucleic acid would suffice for this invention. Site specific chemical cleavage has been reported in Chu, B. C. F. and Orgel, L. E., Proc. Natl. Acad. Sci. USA, 1985, 963-967. Use of a non-specific nuclease to generate blunt ended fragments might also be used. Preferably, though, a type II restriction endonuclease would be used, chosen for accuracy of recognition of its site, maximal processivity and cheap and ready availability.
Step (d) of separating the population of terminal fragments may be achieved by capillary electrophoresis, HPLC or gel electrophoresis. Capillary electrophoresis is preferred, particularly because this can be coupled directly to a mass spectrometer.
In step (e), each unknown sticky end sequence may be determined by:
(i) probing with an array of labelled hybridisation probes, the array containing all possible base sequences of the predetermined length;
(ii) ligating those probes which hybridised to the sticky end sequences; and
(iii) determining which probes be ligated by identification and preferably quantification of the labels.
In one embodiment the array comprises a plurality of sub-arrays which together contain all the possible base sequences, and wherein each sub-array is contacted with the sticky end sequences Unligated probes are removed and these steps are repeated in a cycle so that all of the sub-arrays contact the sticky end sequences. In this way, the array of hybridisation probes is presented to the sticky end sequences in stages. For example, where the predetermined length of base sequence is 4 and the total number of possible base sequences is 256 (44), cross-hybridisation between complementary 4-mers in the array can be avoided by contacting the population of sticky end sequences with a first sub-array of 128 probes and, after removing all unligated probes, contacting with a second sub-array of 128 probes.
The labels are preferably mass labels such as those in accordance with GB 9700746.2 filed on Jan. 15th 1997.
Preferably, the present invention uses an array of hybridisation probes, each of which comprises a mass label linked to a known base sequence of predetermined length, wherein each mass label of the array, optionally together with the known base sequence, is relatable to that base sequence by mass spectrometry. Preferably, each of the hybridisation probes comprises a mass label cleavably linked to a known base sequence of predetermined length, wherein each mass label of the array, when released from its respective base sequence, is relatable to that base sequence by mass spectrometry, typically by its mass/charge ratio which is preferably uniquely identifiable in relation to every other mass label in the array.
In a further aspect, the present invention provides a method for identifying cDNA in a sample. The method comprises characterising cDNA as described above so as to obtain the fragment lengths, the sequences and relative positions of the reference site and sticky-ends and comparing those fragment lengths, sequences and relative positions with the sequences and relative positions of the reference site and sticky-ends of known cDNAs, such as those available from DNA databases, in order to identify the or each cDNA in the sample. This method can be used to identify a single cDNA or a population of cDNAs.
In a further aspect, the present invention provides a method for assaying for one or more specific cDNAs in a sample. This assay method comprises performing a method of characterising cDNA as described above, wherein the reference site and fragment lengths are predetermined, and each sticky-end sequence is determined by assay of a predetermined sticky-end sequence.