Proteins and peptides are synthesized in almost endless variety by living organisms. Many have proven to have medical, agricultural or industrial utility. Some proteins are enzymes, useful as specific catalysts for complex chemical reactions. Others function as hormones, which act to affect the growth or development of an organism or to affect the function of specific tissues in medically significant ways. Specific binding proteins may have commercial significance for the isolation and purification of trace substances and for the removal of contaminating substances. Both proteins and peptides are composed of linear chains of amino acids, the latter term being applied to short, single-chain sequences, the former referring to long-chain and multi-chain substances. The principles of the present invention apply equally to both proteins and peptides.
Proteins and peptides are generally high molecular weight substances, each having a specific sequence of amino acids. Except for the smaller peptides, chemical synthesis of peptides and proteins is frequently impractical, costly and time consuming, if not impossible. In the majority of instances, in order to make practical use of a desired protein, it must first be isolated from the organism which makes it. Frequently, the desired protein is present only in minuscule amounts. Often, the source organism cannot be obtained in quantities sufficient to provide an adequate amount of the desired protein. Consequently, many potential agricultural, industrial and medical applications for specific proteins are known, but remain undeveloped simply because an adequate supply of the desired protein or peptide does not exist.
Recently developed techniques have made it possible to employ microorganisms, capable of rapid and abundant growth, for the synthesis of commercially useful proteins and peptides, regardless of their source in nature. These techniques make it possible to genetically endow a suitable microorganism with the ability to synthesize a protein or peptide normally made by another organism. The technique makes use of a fundamental relationship which exists in all living organisms between the genetic material, usually DNA, and the proteins synthesized by the organism. This relationship is such that the amino acid sequence of the protein is reflected in the nucleotide sequence of the DNA. There are one or more trinucleotide sequence groups specifically related to each of the twenty amino acids most commonly occuring in proteins. The specific relationship between each given trinucleotide sequence and its corresponding amino acid constitutes the genetic code. The genetic code is believed to be the same or similar for all living organisms. As a consequence, the amino acid sequence of every protein or peptide is reflected by a corresponding nucleotide sequence, according to a well understood relationship. Furthermore, this sequence of nucleotides can, in principle, be translated by any living organism.
In its basic outline, a method of endowing a microorganism with the ability to synthesize a new protein involves three general steps: (1) isolation and purification of the specific gene or nucleotide sequence containing the genetically coded information for the amino acid sequence of the desired protein, (2) recombination of the isolated nucleotide sequence with an appropriate transfer vector, typically the DNA of a bacteriophage or plasmid, and (3) transfer of the vector to the appropriate microorganism and selection of a strain of the recipient microorganism containing the desired genetic information.
A fundamental difficulty encountered in attempts to commercially exploit the above-described general process lies in the first step, the isolation and purification of the desired specific genetic information. DNA exists in all living cells in the form of extremely high molecular weight chains of nucleotides. A cell may contain more than 10,000 structural genes, coding for the amino acid sequences of over 10,000 specific proteins, each gene having a sequence many hundreds of nucleotides in length. For the most part, four different nucleotide bases make up all the existing sequences. These are adenine (A), guanine (G), cytosine (C), and thymine (T). The long sequences comprising the structural genes of specific proteins are consequently very similar in overall chemical composition and physical properties. The separation of one such sequence from the plethora of other sequences present in isolated DNA cannot ordinarily be accomplished by conventional physical and chemical preparative methods.
Two general methods have been used in the prior art to accomplish step (1) in the above-described general procedure. The first method is sometimes referred to as the shotgun technique. The DNA of an organism is fragmented into segments generally longer than the desired nucleotide sequence. Step (1) of the above-described process is essentially by-passed. The DNA fragments are immediately recombined with the desired vector, without prior purification of specific sequences. Optionally, a crude fractionation step may be interposed. The selection techniques of microbial genetics are relied upon to select, from among all the possibilities, a strain of microorganisms containing the desired genetic information. The shotgun procedure suffers from two major disadvantages. Most importantly, the procedure can result in the transfer of hundreds of unknown genes into recipient microorganisms, so that during the experiment, new strains are created, having unknown genetic capabilities. Therefore, the use of such a procedure could create a hazard for laboratory workers and for the environment. A second disadvantage of the shotgun method is that it is extremely inefficient for the production of the desired strain, and is dependent upon the use of a selection technique having sufficient resolution to compensate for the lack of fractionation in the first step.
The second general method takes advantage of the fact that the total genetic information in a cell is seldom, if ever, expressed at any given time. In particular, the differentiated tissues of higher organisms may be synthesizing only a minor proportion of the proteins which the organism is capable of making. In extreme cases, such cells may be synthesizing predominantly one protein. In such extreme cases, it has been possible to isolate the nucleotide sequence coding for the protein in question by isolating the corresponding messenger RNA from the appropriate cells.
Messenger RNA functions in the process of converting the nucleotide sequence information of DNA into the amino acid sequence structure of a protein. In the first step of this process, termed transcription, a local segment of DNA having a nucleotide sequence which specifies a protein to be made, is first copied into RNA. RNA is a polynucleotide similar to DNA except that ribose is substituted for deoxyribose and uracil is used in place of thymine. The nucleotide bases in RNA are capable of entering into the same kind of base pairing relationships that are known to exist between the complementary strains of DNA. A and U (T) are complementary, and G and C are complementary. The RNA transcript of a DNA nucleotide sequence will be complementary to the copied sequence. Such RNA is termed messenger RNA (mRNA) because of its status as intermediary between the genetic apparatus of the cell and its protein synthesizing apparatus. Generally, the only mRNA sequences present in the cell at any given time are those which correspond to proteins being actively synthesized at that time. Therefore, a differentiated cell whose function is devoted primarily to the synthesis of a single protein will contain primarily the RNA species corresponding to that protein. In those instances where it is feasible, the isolation and purification of the appropriate nucleotide sequence coding for a given protein can be accomplished by taking advantage of the specialized synthesis of such protein in differentiated cells.
A major disadvantage of the foregoing procedure is that it is applicable only in the relatively rare instances where cells can be found engaged in synthesizing primarily a single protein. The majority of proteins of commercial interest are not synthesized in such a specialized way. The desired proteins may be one of a hundred or so different proteins being produced by the cells of a tissue or organism at a given time. Nevertheless, the mRNA isolation technique is potentially useful since the set of RNA species present in the cell usually represents only a fraction of the total sequences existing in the DNA, and thus provides an initial purification. In order to take advantage of such purification, however, a method is needed whereby sequences present in low frequencies, such as a few percent, can be isolated in high purity.
The present invention provides a process whereby nucleotide sequences can be isolated and purified even when present at a frequency as low as 2% of a heterogeneous population of mRNA sequences. Furthermore, the method may be combined with known methods of fractionating mRNA to isolate and purify sequences present in even lower frequency in the total RNA population as initially isolated. The method is generally applicable to mRNA species extracted from virtually any organism and is therefore expected to provide a powerful basic tool for the ultimate production of proteins of commercial and research interest, in useful quantities.
The process of the present invention takes advantage of certain structural features of mRNA and DNA, and makes use of certain enzyme catalyzed reactions. The nature of these reactions and structural details as they are understood in the prior art are described herewith. The symbols and abbreviations used herein are set forth in the following table:
______________________________________ DNA-deoxyribonucleic acid A-Adenine RNA-ribonucleic acid T-Thymine cDNA-complementary DNA G-Guanine (enzymatically synthesized C-Cytosine from a mRNA sequence) U-Uracil mRNA-messenger RNA Tris-2-Amino-2- hydroxyethyl- dATP-deoxyadenosine 1,3-propanediol triphosphate dGTP-deoxyguanosine EDTA-ethylenediamine triphosphate dCTP-deoxycytidine tetraacetic acid triphosphate HCS-Human Chorionic ATP-adenosine Somatomammotropin triphosphate dTTP-thymidine TCA-Trichloroacetic acid triphosphate ______________________________________
In its native configuration, DNA exists in the form of paired linear polynucleotide strands. The complementary base pairing relationships described above exist between the paired strands such that each nucleotide base of one strand exists opposite its complement on the other strand. The entire sequence of one strand is mirrored by a complementary sequence on the other strand. If the strands are separated, it is possible to synthesize a new partner strand, starting from the appropriate precursor monomers. The sequence of addition of the monomers starting from one end is determined by, and complementary to, the sequence of the original intact polynucleotide strand, which thus serves as a template for the synthesis of its complementary partner. The synthesis of mRNA corresponding to a specific nucleotide sequence of DNA is understood to follow the same basic principle. Therefore a specific mRNA molecule will have a sequence complementary to one strand of DNA and identical to the sequence of the opposite DNA strand, in the region transcribed. Enzymic mechanisms exist within living cells which permit the selective transcription of a particular DNA segment containing the nucleotide sequence for a particular protein. Consequently, isolating the mRNA which contains the nucleotide sequence coding for the amino acid sequence of a particular protein is equivalent to the isolation of the same sequence, or gene, from the DNA itself. If the mRNA is retranscribed to form DNA complementary thereto (cDNA), the exact DNA sequence is thereby reconstituted and can, by appropriate techniques, be inserted into the genetic material of another organism. The two complementary versions of a given sequence are therefore inter-convertible, and functionally equivalent to each other.
The nucleotide subunits of DNA and RNA are linked together by phosphodiester bonds between the 5' position of one nucleotide sugar and the 3' position of its next neighbor. Reiteration of such linkages produces a linear polynucleotide which has polarity in the sense that one end can be distinguished from the other. The 3' end may have a free 3'-hydroxyl, or the hydroxyl may be substituted with a phosphate or a more complex structure. The same is true of the 5' end. In eucaryotic organisms, i.e., those having a defined nucleus and mitotic apparatus, the synthesis of functional mRNA usually includes the addition of polyadenylic acid to the 3' end of the mRNA. Messenger RNA can therefore be separated from other classes of RNA isolated from an eucaryotic organism by column chromatograpy on cellulose to which is attached polythymidylic acid. See Aviv, H., and Leder, P., Proc. Nat. Acad, Sci. USA 69, 1408 (1972) Other chromatiographic methods, exploiting the base-pairing affinity of poly A for chromatographic packing materials containing oligo, dT, poly U, or combinations of poly T and poly U, for example, poly U-Sepharose, are likewise suitable.
Reverse transcriptase catalyzes the synthesis of DNA complementary to an RNA template strand in the presence of the RNA template, a primer which may be any complementary oligo or polynucleotide having a 3'-hydroxyl, and the four deoxynucleotide triphosphates, dATP, dGTP, dCTP, and dTTP. The reaction is initiated by the non-covalent association of the oligodeoxynucleotide primer near the 3' end of mRNA followed by stepwise addition of the appropriate deoxynucleotides, as determined by base pairing relationships with the mRNA nucleotide sequence, to the 3' end of the growing chain. The product molecule may be described as a hairpin structure in which the original RNA is paired by hydrogen bonding with a complementary strand of DNA partly folded back upon itself at one end. The DNA and RNA strands are not covalently joined to each other. Reverse transcriptase is also capable of catalyzing a similar reaction using a single-stranded DNA template, in which case the resulting product is a double-stranded DNA hairpin having a loop of single-stranded DNA joining one set of ends. See Aviv, H. and Leder, P., Proc. Natl. Acad. Sci. USA 69, 1408 (1972) and Efstratiadis, A., Kafatos, F. C., Maxam, A.M., and Maniatis, T., Cell 7, 279 (1976).
Restriction endonucleases are enzymes capable of hydrolyzing phosphodiester bonds in DNA, thereby creating a break in the continuity of the DNA strand. If the DNA is in the form of a closed loop, the loop is converted to a linear structure. The principal feature of a restriction enzyme is that its hydrolytic action is exerted only at a point where a specific nucleotide sequence occurs. Such a sequence is termed the restriction site for the restriction endonuclease. Restriction endonucleases from a variety of sources have been isolated and characterized in terms of the nucleotide sequence of their restriction sites. When acting on double-stranded DNA, some restriction endonucleases hydrolyze the phosphodiester bonds on both strands at the same point, producing blunt ends. Others catalyze hydrolysis of bonds separated by a few nucleotides from each other, producing free single-stranded regions at each end of the cleaved molecule. Such single-stranded ends are self-complementary, hence cohesive, and may be used to rejoin the hydrolyzed DNA. Since any DNA susceptible to cleavage by such an enzyme must contain the same recognition site, the same cohesive ends will be produced, so that it is possible to join heterogeneous sequences of DNA which have been treated with restriction endonuclease to other sequences similarly treated. See Roberts, R. J. Crit. Rev. Biochem. 4, 123 (1976).
It has been observed that restriction sites for a given enzyme are relatively rare and are nonuniformly distributed. Whether a specific restriction site exists within a given segment is a matter which must be empirically determined. However, there is a large nd growing number of restriction endonucleases, isolated from a variety of sources with varied site specificity, so that there is a reasonable probability that a given segment of a thousand nucleotides will contain one or more restriction sites.
For general background see Watson, J. D., The Molecular Biology of the Gene, 3d Ed., Benjamin, Menlo Park, California, (1976); Davidson, J. N., The Biochemistry of the Nucleic Acids, 8th Ed., Revised by Adams, R. L. P., Burdon, R. H., Campbell, A. M. and Smellie, R. M. S., Academic Press, New York, (1976); and Hayes, W., "The Genetics of Bacteria and Their Viruses", Studies in Basic Genetics and Molecular Biology, 2d Ed., Blackwell Scientific Publ., Oxford (1968).