The analysis of large nucleic acid molecules, whether entire genomes or single restriction fragments, usually involves characterization by size, location relative to other fragments of nucleic acid, identification of restriction endonuclease cleavage sites, and nucleotide sequence. These analyses typically require a recursive process of division or cleavage into smaller fragments, separation of these into smaller subsets or individual pieces, and finally the identification of the fragment of interest, typically by hybridization with a specific nucleic acid probe or by direct determination of the nucleotide sequence of the fragment. In practice, the analysis frequently begins with the preparation of a library of cloned nucleic acid fragments constituting an entire genome, or a particular subset of fragments, selected by some convenient criteria.
This identification process is time consuming and expensive because the available means of characterization and selection of nucleic acid fragments are too general (for example by restriction fragment length or by the presence of poly-A in mRNA) or too specific (for example the detection of unique sequences and sequences poorly represented in cloned libraries), or may require prior knowledge of some of the nucleic acid sequence (for example flanking sequences must be known in order to specify the nucleic acid primers needed for amplification of the intervening "target" sequence by the Polymerase Chain Reaction).
It would be useful to have some additional means of characterizing or "indexing" nucleic acid fragments which would permit manipulation and identification to be carried out more efficiently and at lower cost. Such a development would make a significant contribution to a wide variety of molecular biology projects, including such major tasks as the sequencing of the human genome.
Three important techniques have been developed for nucleic acid manipulation and analysis.
The first of these is molecular cloning. (See for example S. N. Cohen et al. (1973) Proc. Nat'l. Acad. Sci. U.S.A. 70:3240-3244, "Construction of Biologically Functional Bacterial Plasmids in vitro"). In its simplest form, this involves first cutting or breaking the target nucleic acid, i.e. DNA, into smaller fragments (typically by restriction endonuclease digestion) and inserting the fragments into a biological vector. The assortment of DNA fragments is then maintained and amplified by the replication of the vector DNA in vivo. Separation of the copies of cloned DNA in this "library" is accomplished by dilution and subsequent growth of bacterial colonies or phage plaques from single organisms bearing copies of only one of the original DNA fragments. Identification of the clones of interest is done by hybridization of a specific labelled probe with the DNA released from each colony or plaque.
More recently, a second technique was developed called the Polymerase Chain Reaction or PCR. (See for example Canadian Patent No. 1,237,685 of K. B. Mullis for "Process for Amplifying Nucleic Acid Sequences"). This technique can be used to isolate and amplify sequences of interest. The technique allows the definition of any "target" portion of a nucleic acid sequence by the sequences which lie adjacent to it. Consequently, hybridization of nucleic acid primers at these adjacent sites permits the replication of only the intervening target sequence and the adjacent primer sites. The selective amplification by repeated replication in this way results directly in the separation of the desired fragment (or subset of sequences) by effective dilution of all other unwanted sequences by replicated copies of the target sequence. Identification is then carried out by hybridization against a known probe, or more frequently, by simple size analysis by agarose or polyacrylamide gel electrophoresis to confirm that the desired target sequence has been amplified.
A third major technique used for comparative genomic analysis is called Restriction Fragment Length Polymorphism, or RFLP, analysis. See for example: D. N. Cooper and J. Schmidtke (1984) Human Genetics 66:1-16, "DNA Restriction Fragment Length Polymorphisms and Heterozygosity in the Human Genome". Insertions, deletions, and some types of single base substitutions can be detected and their inheritance (and the inheritance of other mutations known to be closely linked) determined. Specific individuals can uniquely identified from a modification of this technique known popularly as "DNA Fingerprinting". See for example: A. J. Jeffries et al. (1985) Nature 314:67-73, "Hypervariable `Minisatellite` Regions in Human DNA". This third technique also begins with restriction endonuclease cleavage of genomic, cloned or PCR-amplified DNA, into fragments. The resulting fragments are separated according to size by gel electrophoresis, and certain target fragments or groups of fragments are identified by hybridization with a specific probe. In this case, the sizes of fragments identified by hybridization with the probe provide a measure of whether the target sequence complementary to the probe is part of an identical or analogous fragment from other individuals.
While each of these three techniques, and the many specific variations which have evolved from them, are extremely valuable in investigating various aspects of structure and organization of particular genes, they each suffer from disadvantages.
Molecular cloning of a mixture of all the fragments from a restriction endonuclease digest of genomic DNA may provide a library, which on statistical grounds, should contain representatives of all fragments. In fact there may be a selective bias against some sequences due to the spacing of the restriction sites, or the propensity of some sequences to mutate, rearrange or fail to replicate in vivo. See for example: U. Gubler and B. J. Hoffman (1983) Gene 25:263-269, "A Simple and Very Efficient Method for Generating cDNA Libraries", and T. Maniatis et al (1978) Cell 15:687-701, "The Isolation of Structural Genes from Libraries of Eukaryotic DNA", and K. Kaiser and N. Murray (1985) DNA Cloning, Vol. 1: A Practical Approach, "The Use of Phage Lambda Replacement Vectors in the Construction of Representative Genomic DNA Libraries". As a consequence, any sequence which is present in the library at low frequency may be very difficult to detect, requiring screening of large numbers of colonies or plaques. Another disadvantage is that subsequent manipulation to ensure the purity and identity of clones or to isolate smaller fragments of the target clone also contributes to significant delay and expense. That is, it is necessary either to undertake the expense of screening large numbers of clones to detect a low probability event directly, or to undertake the extra procedures of attempting to enrich the population of clones screened for the target of interest.
The major disadvantage of the PCR technique is the requirement for prior knowledge of the nucleic acid sequences flanking the region of interest which permits specification of the primers required to amplify that intervening sequence. Where applicable, this technique offers extremely high precision at relatively low cost, but is limited to targets which have already been the subject of investigation at least to the extent of obtaining the necessary flanking sequence information. A second disadvantage is that for non-repetitive analyses of large numbers of different targets, the cost of two unique primers required per target may become prohibitive. This technique is currently limited to a maximum distance between primer sites of only a few kilobases of DNA. This appears to be a minor limitation, but reduces the possibilities for investigation of larger structural and functional units in a genome.
In contrast to the PCR technique, RFLP analysis does permit comparisons to be made among fragments of unknown sequences. However it is limited to detections of certain types of mutations or variations, namely to significant insertions and deletions large enough to change the sizes of fragments, or to insertions, deletions or base changes within restriction endonuclease recognition sequences which prevent cleavage between two fragments or which generate new recognition sequences. In the absence of such polymorphic or polyallelic genetic markers closely linked to the loci of interest for genetic diagnosis, RFLP analysis is unable to provide the desired inferences about the inheritance or identity of these loci; see: D. N. Cooper and J. Schmidtke [supra]. The search for suitable probes and the characterization of such variant sites is thus a relatively inefficient trial and error process.
Each of these three important techniques for nucleic acid analysis has been outstandingly successful in those areas for which it is applicable. Each in turn has limitations based upon the limited information available about the fragments or groups of fragments, or conversely, based upon the limited ability to reduce the complexity of a mixture (i.e. to select and isolate a suitable subset) to permit more detailed investigation. It would be a powerful extension of all of these techniques to provide a method for subsetting complex mixtures of nucleic acids in a consistent and efficient manner, to provide a level of information intermediate between the relatively crude measure of sizes of restriction fragments, and the precision of partial or complete sequence determination. Three aspects of recombinant DNA technology are important in achieving such a method.
First, certain types of restriction endonucleases cleave DNA to reveal cohesive ends which may be non-identical and unrelated to the recognition sequence of the enzyme used. One group is called Type IIS or "shift" restriction endonucleases. See for example: W. Szybalski (1985) Gene 40:169-173, "Universal Restriction Endonucleases: Designing Novel Cleavage Specificities by Combining Adapter Oligodeoxyribonucleotide and Enzyme Moieties"; Kessler et al (1985) Gene 33:1-102, "Recognition Sequences of Restriction Endonucleases and Methylases--A Review". These Type IIS endonucleases cut DNA at sites removed by one or more bases from the recognition sequence (which is usually non-palindromic). A second group of restriction endonucleases have interrupted palindromic recognition sequences and they cut irrespective of the nature of the intervening sequences, provided that the intervening sequence is of the appropriate length.
The cohesive ends of the resulting DNA fragments may contain all possible permutations and combinations of nucleotides. For example the restriction endonuclease FokI (see Kessler et al [supra]) generates 4 base, 5'-cohesive ends, by cutting one strand 9 nucleotides 3'- to the recognition sequence GGATG and correspondingly on the opposite strand 13 nucleotides 5'- to the complementary CATCC sequence:
______________________________________ .dwnarw. ##STR1## ##STR2## .uparw. ______________________________________
If a large genome was cut with this enzyme, then all 4.sub.4 or 256 possible tetranucleotide ends should be represented in the resulting mixture of DNA fragments.
A universal restriction endonuclease is described in published European patent application No. 234,781 (Szybalski). The universal endonuclease utilizes a tailored oligodeoxynucleotide adaptor in conjunction with a Class IIS endonuclease. The adaptor consists of a single stranded region complementary to a single stranded region of target DNA at the desired cleavage site. Adjacent to the single stranded region of the adaptor is a hairpin region containing the recognition sequence of the endonuclease. The adaptor is constructed so that the endonuclease will bind to a recognition sequence in the double stranded portion and will cleave the single stranded target region at the desired site, once the adaptor has been hybridized to the complementary region of the target DNA. The described adaptor is intended to be bound to target DNA only by means of base pairing. Consequently, the single stranded region of the adaptor, complementary to the target DNA, must be of sufficient length to anchor the adaptor throughout the cleavage process. The adaptor disclosed has a recognition sequence for FokI and a single stranded region of 14 nucleotides in which cleavage will occur. This teaching provides highly specific cleavage of single stranded DNA at any desired sequence.
Brenner, S. and Livak, K. J. (1989) Proc. Natl. Acad. Sciences USA 86:8902-8906 provide a method for characterizing DNA fragments by both size and terminal sequence. In their method, fragments are produced as a result of cleavage by a Type IIS endonuclease. In case of the endonuclease FokI, 4 base, 5'-single stranded overhangs having non-identical sequences were generated. DNA polymerase was used to attach fluorescent labelled nucleotides complementary to the bases in the 5'-cohesive ends. Cleavage using a different endonuclease was carried out resulting in the presence of some fragments having fluorescent labelled ends. Analysis of the fluorescently labelled DNA by gel electrophoresis in an automated DNA sequencing apparatus provided the sequence of the fragment cohesive ends produced by the Type IIS endonuclease cleavage and the length of the fragments.
The second aspect is similar to a prior invention: DNA adaptors. See for example: R. J. Wu et al, U.S. Pat. No. 4,321,365, "Oligonucleotides useful as Adaptors in DNA Cloning, Adapted Molecules, and Methods of Preparing Adaptors and Adapted Molecules". Adaptors are short double stranded DNA molecules with either one or both ends having protruding single stranded regions which are recognition sites of restriction endonucleases. They can be covalently attached to other DNA fragments bearing the complementary base pairs for the same restriction endonuclease recognition sequence (e.g. fragments generated as a result of cleavage of a larger fragment with the same endonuclease) using a polynucleotide ligase. This provides a tool for molecular cloning as the same adaptor molecule may be used to introduce any double stranded DNA into cloning vehicles at specific sites.
Finally, the third aspect is the ability to synthesize chemically any desired nucleic acid sequence by the phosphotriester method (see S. A. Narang et al. (1979) Methods Enzymol. 68:90-98 "Improved Phosphotriester Method for Synthesis of Gene Fragments) or the phosphoramidite method (for example see: M. D. Matteucci and M. H. Carruthers (1981) J.Amer. Chem. Soc. 103:3186-3191 "Synthesis of Deoxyoligonucleotides on a Polymer Support") offers the practical means to design and prepare oligonucleotide primers, probes, adaptors and linkers at will to suit any desired application.