A common practice in molecular biology is to create "gene libraries," which are collections of cloned fragments of DNA that represent genetic information in an organism, tissue or cell type. To construct a library, desired DNA fragments are prepared and inserted by molecular techniques into self-replicating units generally called cloning vectors. Each DNA fragment is therefore represented as part of an individual molecule, which can be reproduced in a single bacterial colony or bacteriophage plaque. Individual clones of interest can be identified by various screening methods, and then grown and purified in large quantities to allow study of gene organization, structure and function.
Only a small fraction of the genetic information for an organism is actually used in an individual cell or tissue at a particular time. A cDNA library is a type of gene library in which only DNA for actively expressed genes is cloned. These active genes can be selectively cloned over silent genes because the DNA for active genes is transcribed into messenger RNA (mRNA) as part of the pathway by which proteins are made. RNA molecules are polar in nature, i.e. the constituent nucleoside bases are linked via phosphodiester bonds between the 3' ribosyl position of one nucleoside and the 5' ribosyl position on the following nucleoside. RNA is synthesized in the 5'.fwdarw.3' direction, and mRNAs are read by ribosomes in the same direction, such that proteins are synthesized from N-terminus to C-terminus. Over the past decade, cDNA libraries have become the standard source from which thousands of genes have been isolated for further study.
The first step in preparing a cDNA library is to purify the mRNA, which usually represents only about 1-3% of the total RNA of the cell, the remainder being ribosomal RNA, transfer RNA, and several other RNA species. Many mRNAs from eukaryotic organisms have a poly(A) "tail," a tract of 50-150 adenosine residues at their 3' ends. A general practice for purifying mRNA from total cellular RNA involves specifically annealing, or binding, the poly(A) tail to oligo(dT), a single stranded DNA molecule of between about 12 and 30 consecutive dT residues (Jacobson, A. (1987) Meth. Enzymol. 152, 254). Total cellular RNA can be incubated with a matrix to which oligo(dT) has been immobilized. Only RNA molecules containing poly(A) tails selectively anneal to the matrix.
Upon purification of poly(A).sup.+ RNA, a double-stranded complementary DNA (cDNA) copy of this active RNA can be synthesized in vitro by two sequential enzymatic steps. An RNA-dependent DNA polymerase, known as a reverse transcriptase, is used to synthesize the first strand cDNA (complementary DNA), using the RNA as a template. Then, a DNA-dependent DNA polymerase, typically E. coli DNA polymerase I, copies the newly synthesized first cDNA strand to form a complementary second cDNA strand. A popular method of second strand synthesis utilizes the enzyme RNase H to create "nicks" in the mRNA strand. The resulting short mRNA fragments serve as primers for second strand synthesis by the DNA polymerase (Gubler, U. (1987) Meth. Enzymol. 152, 330). Both polymerases synthesize DNA in the 5'.fwdarw.3' direction, reading the template strand from the 3'.fwdarw.5' direction.
Double-stranded cDNA thus prepared is inserted into a prepared cloning vector. To efficiently insert the cDNA into a cloning vector, the ends of the insert cDNA and the vector DNA molecules must be prepared such that they are compatible. For example, specialized linkers can be added to the cDNA ends, followed by digestion with the relevant enzyme to create single stranded protrusions that will anneal to corresponding ends in the vector. The insert and vector molecules are ligated together with T4 DNA ligase. The ligated vectors carrying their cDNA molecule inserts are then introduced into E. coli and screened. Various approaches have been used to prepare the cDNA ends for vector insertion (Kimmel, A. R. and Berger, S. L. (1987) Meth. Enzymol. 152, 307). Most have used the "linker" or "adapter" method described above. All methods using linkers require an additional step to protect the cDNA from being cleaved at adventitious restriction sites during digestion to create the cohesive ends (Wu, R., Wu, T. and Ray, A. (1987) Meth. Enzymol. 152, 343). This protection is accomplished either by treating the cDNA with on site-specific methylases or by substituting a methylated dCTP analog for unmodified dCTP in the synthesis reactions.
In spite of the success of cDNA libraries as a resource, several technical difficulties have limited their wider application or have necessitated a large amount of effort to obtain complete gene sequences. One difficulty concerns the under-representation of the 5' ends of gene sequences obtained from cDNA libraries. As noted, first strand synthesis uses an RNA-dependent DNA polymerase. No DNA polymerase can start cDNA synthesis de novo. DNA polymerases require a short primer as a starting material upon which to add bases to the 3' end of a nascent cDNA first strand. The simplest primer is an oligo(dT) primer that can anneal specifically to the 3' poly(A) tail found in most mRNA molecules. All cDNAs synthesized with an oligo(dT) primer thus start at the 3' end of the mRNA and share a common 3' sequence (i.e. the d(A.sub.n :T.sub.n) tail). The major pitfall of oligo(dT)-primed synthesis is that RNA-dependent DNA polymerases tend to become disengaged from the mRNA template before traversing its entire length. It is thought that this is primarily due to random failure in the elongation process and to specific areas of RNA secondary structure at which the enzyme may pause or stop altogether. In oligo(dT)-primed libraries, the 3' ends of mRNAs are, therefore, statistically more likely to be copied than the sequences closer to the 5' end because reverse transcription always commences from the point at which the primer anneals. The resulting cDNA population is therefore biased toward the 3' ends of RNA strands. As might be expected, the effect is particularly noticeable with long mRNAs and results in few or no complete cDNA clones for certain genes in the library. Good quality oligo(dT)-primed cDNA libraries contain some inserts from 4 to 8 kbp, but even inserts of this length may not cover the 5' end of a desired gene.
In addition, some mRNAs have a poly(A) tail that is too short to anneal to the oligo(dT) primer or have no poly(A) tail at all (Greenberg, Biochemistry 15:3516-3522 (1976); Adesnik and Darnell, J. Mol. Biol. 67:397-406 (1982); Houdebine, FEBS Lett. 66:110-118 (1976)). Estimates of the percent of non-polyadenylated mRNA in different species ranges from 30% (Milcarek et al., Cell 3:1-10 (1974)) to 80% (Miller, Dev. Biol. 64:118-129 (1978)) of mRNA. In a comparison of poly(A).sup.+ and poly(A).sup.- mRNA isolated from mouse brain, Van Ness et al., Cell 18:1341-1349 (1979) found that a substantial proportion of non-polyadenylated mRNA contains unique protein-encoding sequences. Therefore, many potentially important genes might be unrepresented in oligo(dT)-primed cDNA libraries.
Both of the above-identified problems can be overcome using an alternate type of cDNA primer known as a random primer to produce so-called "random primed libraries." Rather than being a single species, a random primer is, in actuality, a collection or set of primers of a certain length, usually hexameric, wherein the set includes all possible arrangements of the 4 DNA nucleoside bases over the length of the primer. Thus, a random hexamer is actually a collection of 4.sup.6, or 4096, different primer sequences each of which is capable of annealing specifically with its complementary sequence in mRNA. Since every possible 6-base long portion of the mRNA has a complement in the set of random hexamer primers, the population of cDNA first strands generated using random primers share neither a common origin on the mRNA nor a common 3' sequence. The bias for 3' ends is not a problem in random primed libraries because the primer mix of all possible hexamers promotes initiation of cDNA synthesis at any point on the mRNA. No portion of the mRNA molecule is better represented than any other in the population of cDNA first strands.
A common practice in the field is to supplement screening of oligo(dT)-primed libraries with random primed libraries to obtain full-length clones. Random-primed libraries have also been used for intentionally cloning cDNA fragments as a means to obtain gene regions encoding DNA binding proteins (Singh et al., Cell 52:415 (1988); Vinson et al., Genes Dev. 2:801 (1988)). The inability of some mRNAs to be primed with oligo(dT) makes it essential to construct random primed libraries when the mRNA is non-polyadenylated.
A popular modification of the standard oligo(dT) priming strategy takes advantage of the common 3' ends of the resulting cDNA to allow the cloning of cDNA molecules in a defined orientation (directional cloning) (Ausubel, et al. (eds) in Current Protocols in Molecular Biology, John Wiley & Sons (1995) Supplement 29). Directional cDNA cloning has two major benefits. First, it reduces the amount of work required to retrieve a clone of interest when using any detection scheme based on protein or peptide expression, such as antibody screening. Expression of the desired protein or peptide requires not only that the DNA fragment containing the gene of interest be present, but also that the fragment is provided in the proper orientation and in the correct reading frame to direct the synthesis of that protein. In a non-directional library, statistically only 1 clone in 6 will meet this requirement, since there are two possible orientations and three possible reading frames for every clone. In contrast, directionally cloned cDNA libraries eliminate the orientation variable, thereby doubling the likelihood of successfully expressing a protein from a given clone and effectively reducing by a factor of two the number of clones that must be screened. The immediate result is diminished labor costs.
The second, and perhaps more important, advantage of directional cloning arises in connection with the construction of subtractive cDNA libraries. Subtractive cDNA libraries are collections of cDNA clones from genes expressed in one tissue or during one developmental state, but not in another. Subtractive cDNA libraries are used to rapidly identify genes important in development or progression of a disease, even in the absence of prior information about the genes. For example, a subtractive cDNA library can identify genes that are specifically active in cancer cells (Scott et al., Cell 34:557-567 (1983); Krady et al., Mol. Brain Res. 7:287-297 (1990)).
Whereas many strategies have been used to create subtractive libraries, one of the most successful is based on the use of directionally cloned cDNA libraries as starting material (Palazzolo and Meyerowitz, (1987) Gene 52, 197); Palazzolo et al. (1989) Neuron 3, 527; Palazzolo et al. (1990) Gene 88, 25). In this approach, cDNAs prepared from a first source tissue are directionally inserted immediately downstream of a bacteriophage T7 promoter in the vector. Total library DNA is prepared and transcribed in vitro with T7 RNA polymerase to produce large amounts of RNA that correspond to the original mRNA from the first source tissue. Sequences present in both the source tissue and another tissue are subtracted as follows. The in vitro transcribed RNA prepared from the first source is allowed to hybridize with cDNA prepared from either native mRNA or library RNA from the second source tissue. The complementarity of the cDNA to the RNA makes it possible to remove common sequences as they anneal to each other, allowing the subsequent isolation of unhybridized, presumably tissue-specific, cDNA. This approach is only possible using directional cDNA libraries, since any cDNA sequence in a non-directional library is as likely to be in the "sense" orientation as the "antisense" direction (sense and antisense are complementary to each other). A cDNA sequence unique to a tissue would be completely removed during the hybridization procedure if both sense and antisense copies were present.
In one directional cloning strategy, a DNA sequence encoding a specific restriction endonuclease recognition site (usually 6-10 bases) is provided at the 5' end of the oligo(dT) primer (Palazzolo and Meyerowitz 1987). This relatively short recognition sequence does not affect the annealing of the 12-20 base oligo(dT) primer to the mRNA, so the cDNA second strand synthesized from the first strand template includes the new recognition site added to the original 3' end of the coding sequence. After second strand cDNA synthesis, a blunt ended linker molecule containing a second restriction site (or a partially double stranded linker adapter containing a protruding end compatible with a second restriction site) is ligated to both ends of the cDNA. The site encoded by the linker is now on both ends of the cDNA molecule, but only the 3' end of the cDNA has the site introduced by the modified primer. Following the linker ligation step, the product is digested with both restriction enzymes (or, if a partially double stranded linker adapter was ligated onto the cDNA, with only the enzyme that recognizes the modified primer sequence). A population of cDNA molecules results which all have one defined sequence on their 5' end and a different defined sequence on their 3' end.
A related directional cloning strategy developed by Meissner et al. ((1987) Proc. Natl. Acad. Sci USA 84, 4171), requires no sequence-specific modified primer. Meissner et al. describe a double stranded palindromic BamHI/HindIII directional linker having the sequence d(GCTTGGATCCAAGC) (SEQ: ID NO:1), which is ligated to a population of oligo(dT)-primed cDNAs, followed by digestion of the ligation products with BamHI and HindIII. This palindromic linker, when annealed to double stranded form, includes an internal BamHI site (GGATCC) flanked by 4 of the 6 bases that define a HindIII site (AAGCTT). The missing bases needed to complete a HindIII site are d(AA) on the 5' end or d(TT) on the 3' end. Regardless of the sequence to which this directional linker ligates, the internal BamHI site will be present. However, HindIII can only cut the linker if it ligates next to an d(AA):d(TT) dinucleotide base pair. In an oligo(dT)-primed strategy, a HindIII site is always generated at the 3' end of the cDNA after ligation to this directional linker. For cDNAs having the sequence d(TT) at their 5' ends (statistically 1 in 16 molecules), linker addition will also yield a HindIII site at the 5' end. However, because the 5' ends of cDNA are heterogeneous due to the lack of processivity of reverse transcriptases, cDNA products from every gene segment will be represented in the library.
As described above, a major limitation on cDNA cloning technology is imposed by the available priming strategies. Oligo(dT)-primed libraries require poly(A).sup.+ RNA and generally are deficient in 5' sequences. Random primed cDNA libraries have not found general application, partly due to technical difficulties in their construction, and more recently due to the increasing use of incompatible directional cloning strategies. An ideal strategy would combine the directionality of oligo(dT) priming with the sequence independence of random priming. Despite the identified advantages of both random priming and directional cloning, no operative method exists for forming cDNA libraries by directionally cloning random primed cDNAs.
Others have tried, with limited success, to combine random priming and directional cloning. A "5' stretch" technique used in some laboratories employs both an oligo(dT) primer and random hexamers for priming two separate first strand cDNA reactions. The discontinuous cDNA fragments are spliced together during second strand synthesis when the two reactions are combined. After second strand synthesis, linkers of the type described above are added, to facilitate directional cloning. The shortcoming of this strategy is that any spliced cDNA molecule that fails to incorporate oligo(dT) at its 3' end is lost from the library because it cannot regenerate the 3' enzyme recognition sequence that must be present to generate a proper end for ligation. This strategy also does not address the inherent problems attributable to the secondary structure of RNA or to the lack of an adequate poly(A) tail.
Still others have attempted to use a set of random hexameric primers engineered to also include a common restriction site of six or more bases at one end of each primer. These primers have not been successfully used to prime first strand synthesis. The failure has been attributed to the formation of unstable RNA-primer hybrids. Because the length of the engineered restriction site equals or exceeds the length of the random hexamers, proper hybridization of the random portion of the primers may be energetically unfavorable. Moreover, the presence of six defined bases as part of every primer might bias hybridization toward corresponding complementary portions of the RNA templates.