The invention herein provides the isolated and purified (hereinafter "cloned") human gene coding for proinsulin and the human gene coding for pre-proinsulin, methods for isolating and purifying the genes and a method for transferring the genes to and replicating the genes in a microorganism. The cloned genes are expressed by a host microorganism when fused with a host-expressable procaryotic gene. Both genes are useful in the production of human insulin for therapeutic purposes.
Insulin is a hormone produced primarily by the B cells of the pancreas. At the present time, the use of this hormone in the treatment of diabetis is well-known. Although slaughterhouses provide beef and pig pancreases as insulin sources, a shortage of this hormone is developing as the number of diabetics increases worldwide. Moreover, some diabetics develop an allergic reaction to beef and pig insulin, with deleterious effects. The ability to produce human insulin in quantities sufficient to satisfy world needs is therefore highly desirable. The present invention provides genes, which are insertable into microorganisms, which are useful in the production of human insulin.
Insulin consists of two polypeptide chains, known as the A and B chains, linked together by disulfide bridges. The A chain consists of 21 amino acids and the B chain consists of 30 amino acids. These chains are not synthesized independently in vivo but are derived from an immediate precursor, termed proinsulin: Proinsulin is a single polypeptide chain that contains a peptide, termed the C-peptide, which connects the A and B chains. See Steiner, D. F. et al., Science 157, 697 (1967). This C-peptide is excised during the packaging of insulin into the secretory granules of pancreatic B cells prior to secretion. See Tager, H. S. et al., Ann. Rev. Biochem. 43, 509 (1974). The current view of the function of the C-peptide is that it functions only in forming the three dimensional structure of the molecule. The amino acid sequence for human proinsulin, determined by conventional techniques, is given in Table 1. In this table the B chain is amino acids 1-30, the C-peptide is amino acids 31-65 and th A chain is amino acids 66-86.
TABLE 1 ______________________________________ ##STR1## ##STR2## ##STR3## ##STR4## ##STR5## ##STR6## ##STR7## ##STR8## ______________________________________
Chemical synthesis of this sequence of 86 amino acids though feasible is difficult conventional techniques.
In the pancreatic B cells, the initial translation product is not proinsulin itself, but a pre-proinsulin that contains more than 20 additional amino acids on the amino terminus of proinsulin. See Cahn, S. J. et al., Proc. Nat. Acad. Sci. U.S.A. 73, 1964 (1976) and Lomedico, P. T. et al., Nucl. Acid Res. 3, 381 (1976). The additional amino acid sequence is termed the signal peptide. In human pre-proinsulin (see FIG. 2), the signal peptide has twenty-four amino acids and the sequence is ##STR9## The twenty-four amino acid sequence is thought to be a specific signal for the vectorial transport of the synthesized polypeptide into the endoplasmic reticulum of the B cell, and is cleaved away from proinsulin during this phase. See Blobel, G. et al., J. Cell. Biol. 67, 835 (1975).
Several instances of signal peptides are known for eucaryotic proteins to be transported across membrane barriers. A specific cleavage enzyme has been observed in a cell-free system which hydrolyzes the peptide bond between the signal peptide and the active protein concomitant with passage through a cell membrane. (See, Blobel, G. et al., Proc. Nat. Acad. Sci U.S.A. 75, 361 (1978)).
Recent advances in biochemistry and in recombinant DNA technology have made it possible to achieve the synthesis of specific proteins under controlled conditions independent of the higher organism from which they are normally isolated. Such biochemical synthetic methods employ enzymes and subcellular components of the protein synthesizing machinery of living cells, either in vitro, in cell-free systems, or in vivo, in microorganisms. In either case, the key element is provision of a deoxyribonucleic acid (DNA) of specific sequence which contains the information necessary to specify the desired amino acid sequence. Such a specific DNA is herein termed a gene. The coding relationship whereby a deoxynucleotide sequence is used to specify the amino acid sequency of a protein is described briefly, infra, and operates according to a fundamental set of principles that obtain throughout the whole of the known realm of living organisms.
A cloned gene may be used to specify the amino acid sequence of proteins synthesized by in vitro systems. DNA-directed protein synthesizing systems are well-known in the art, see, e.g., Zubay, G., Ann. Rev. Genetics 7, 267 (1973). In addition, single-stranded DNA can be induced to act as messenger RNA in vitro, resulting in high fidelity translation of the DNA sequence (Salas, J. et al., J. Biol. Chem. 243, 1012 (1968). Other techniques well known in the art may be used in combination with the above procedures to enhance yields.
Developments in recombinant DNA technology have made it possible to isolate specific genes or portions thereof from higher organisms, such as man and other mammals, and to transfer the genes or fragments to a microorganisms, such as bacteria or yeast. The transferred gene is replicated and propagated as the transformed microorganism replicates. As a result, the transformed microorganism may become endowed with the capacity to make whatever protein the gene or fragment encodes, whether it be an enzyme, a hormone, an antigen or an antibody, or a portion thereof. The microorganism passes on this capability to its progeny, so that in effect, the transfer has resulted in a new strain, having the described capability. See, for example, Ullrich, A. et al., Science 196, 1313 (1977), and Seeburg, P. H., et l., Nature 270, 486 (1977). A basic fact underlying the application of this technology for practical purposes is that DNA of all living organisms, from microbes to man, is chemically similar, being composed of the same four nucleotides. The significant differences lie in the sequences of these nucleotides in the polymeric DNA molecule. The nucleotide sequences are used to specify the amino acid sequences of proteins that comprise the organism. Although most of the proteins of different organisms differ from each other, the coding relationship between nucleotide sequence and amino acid sequence is fundamentally the same for all organisms. For example, the same nucleotide sequence which codes for the amino acid sequence of HGH in human pituitary cells, will, when transferred to a microorganism, be recognized as coding for the same amino acid sequence.
Abbreviations used herein are given in Table 2.
TABLE 2 ______________________________________ DNA--deoxyribonucleic acid A Adenine RNA--ribonucleic acid T--Thymine cDNA--complementary DNA G--Guanine (enzymatically synthesized C--Cytosine from an mRNA sequence) U--Uracil mRNA--messenger RNA ATP--adenosine triphosphate dATP--deoxyadenosine triphos- TTP--Thymidine triphosphate phate EDTA--Ethylenediaminetetra- dGTP--deoxyguanosine triphos- acetic acid phate dCTP--deoxycytidine triphos- phate ______________________________________
The coding relationships between nucleotide sequence in DNA and amino acid sequence in protein are collectively known as the genetic code, shown in Table 3.
TABLE 3 ______________________________________ Genetic Code ______________________________________ Phenylalanine(Phe) TTK Histidine(His) CAK Leucine(Leu) XTY Glutamine(Gln) CAJ Isoleucine(Ile) ATM Asparagine(Asn) AAK Methionine(Met) ATG Lysine(Lys) AAJ Valine(Val) GTL Aspartic acid(Asp) GAK Serine(Ser) QRS Glutamic acid(Glu) GAJ Proline(Pro) CCL Cysteine(Cys) TGK Threonine(Thr) ACL Tryptophan(Try) TGG Alanine(Ala) GCL Arginine(Arg) WGZ Tyrosine(Tyr) TAK Glycine(Gly) GGL Termination signal TAJ Termination signal TGA ______________________________________
Key: Each 3-letter deoxynucleotide triplet corresponds to a trinucleotide of mRNA, having a 5'-end on the left and 3'-end on the right. All DNA sequences given herein are those of the strand whose sequence corresponds to the mRNA sequence, with thymine substituted for uracil. The letters stand for the purine or pyrimidine bases forming this deoxynucleotide sequence.
A=adenine PA1 G=guanine PA1 C=cytosine PA1 T=thymine PA1 X=T or C if Y is A or G PA1 Y=A, G, C or T if X is C PA1 Y=A or G if X is T PA1 W=C or A if Z is A or G PA1 W=C if Z is C or T PA1 Z=A, G, C or T if W is C PA1 Z=A or G if W is A PA1 QR=TC if S is A, G, C or T PA1 QR=AG if S is T or C PA1 S=A, G, C or T if QR is TC PA1 S=T or C if QR is AG PA1 J=A or G PA1 K=T or C PA1 L=A, T, C or G PA1 M=A, C or T
An important feature of the code, for present purposes, is the fact that each amino acid is specified by a trinucleotide sequence, also known as a nucleotide triplet. The phosphodiester bonds joining adjacent triplets are chemically indistinguishable from all other internucleotide bonds in DNA. Therefore the nucleotide sequence cannot be read to code for a unique amino acid sequence without additional information to determine the reading frame, which is the term used to denote the grouping of triplets used by the cell in decoding the genetic message.
Many recombinant DNA techniques employ two classes of compounds, transfer vectors and restriction enzymes, to be discussed in turn. A transfer vector is a DNA molecule which contains, inter alia, genetic information which insures its own replication when transferred to a host microorganism strain. Examples of transfer vectors commonly used in bacterial genetics are plasmids and the DNA of certain bacteriophages. Although plasmids have been used as the transfer vectors for the work described herein, it will be understood that other types of transfer vectors may be employed. Plasmid is the term applied to any autonomously replicating DNA unit which might be found in a microbial cell, other than the genome of the host cell itself. A plasmid is not genetically linked to the chromosome of the host cell. Plasmid DNA's exist as double-stranded ring structures generally on the order of a few million daltons molecular weight, although some are greater than 10.sup.8 daltons in molecular weight. They usually represent only a small percent of the total DNA of the cell. Transfer vector DNA is usually separable from host cell DNA by virtue of the great difference in size between them. Transfer vectors carry genetic information enabling them to replicate within the host cell, in most cases independently of the rate of host cell division. Some plasmids have the property that their replication rate can be controlled by the investigator by variations in the growth conditions. By appropriate techniques, the plasmid DNA ring may be opened, a fragment of heterologous DNA inserted, and the ring reclosed, forming an enlarged molecule comprising the inserted DNA segment. Bacteriophage DNA may carry a segment of heterologous DNA inserted in place of certain non-essential phage genes. Either way, the transfer vectors serves as a carrier or vector for an inserted fragment of heterologous DNA.
Transfer is accomplished by a process known as transformation. During transformation, bacterial cells mixed with plasmid DNA incorporate entire plasmide molecules into the cells. Although the mechanics of the process remain obscure, it is possible to maximize the proportion of bacterial cells capable of taking up plasmid DNA and hence of being transformed, by certain empirically determined treatments. Once a cell has incorporated a plasmid, the latter is replicated within the cell and the plasmid replicas are distributed to the daughter cells when the cell divides. Any genetic information contained in the nucleotide sequence of the plasmid DNA can, in principle, be expressed in the host cell. Typically, a transformed host cell is recognized by its acquisition of traits carried on the plasmid, such as resistance to certain antibiotics. Different plasmids are recognizable by the different capabilities or combination of capabilities which they confer upon the host cell containing them. Any given plasmid may be made in quantity by growing a pure culture of cells containing the plasmid and isolating the plasmid DNA therefrom.
Restriction endonucleases are hydrolytic enzymes capable of catalyzing site-specific cleavage of DNA molecules. The locus of restriction endonuclease action is determined by the existence of a specific nucleotide sequence. Such a sequence is termed the recognition site for the restriction endonuclease. Restriction endonucleases from a variety of sources have been isolated and characterized in terms of the nucleotide sequence of their recognition sites. Some restriction endonucleases hydrolyze the phosphodiester bonds on both strands at the same point, producing blunt ends. Others catalyze hydrolysis of bonds separated by a few nucleotides from each other, producing free single stranded regions at each end of the cleaved molecule. Such single stranded ends are self-complementary, hence cohesive, and may be used to rejoin the hydrolyzed DNA. Since any DNA susceptible of cleavage by such an enzyme must contain the same recognition site, the same cohesive ends will be produced, so that it is possible to join heterologous sequences of DNA which have been treated with a restriction endonuclease to other sequences similarly treated. See Roberts, R. J., Crit. Rev. Biochem. 4, 123 (1976). Restriction sites are relatively rare, however the general utility of restriction endonucleases has been greatly amplified by the chemical synthesis of double stranded oligonucleotides bearing the restriction site sequence. Therefore virtually any segment of DNA can be coupled to any other segment simply by attaching the appropriate restriction oligonucleotide to the ends of the molecule, and subjecting the product to the hydrolytic action of the appropriate restriction endonuclease, thereby producing the requisite cohesive ends. See Heyneker, H. L., et al., Nature 263, 748 (1976) and Scheller, R. H. et al., Science 196, 177 (1977). An important feature of the distribution of restriction endonuclease recognition sites is the fact that they are randomly distributed with respect to reading frame. Consequently, cleavage by restriction endonuclease may occur between adjacent codons or it may occur within a codon.
More general methods of DNA cleavage or for end sequence modification are available. A variety of nonspecific endonucleases may be used to cleave DNA randomly, as discussed infra. End sequences may be modified by creation of oligonucleotide tails of dA on one end and dT at the other, or of dG and dC, to create sites for joining without the need for specific linker sequences.
The term "expression" is used in recognition of the fact than an oganism seldom if ever makes use of all its genetically endowed capabilities at any given time. Even in relatively simple organisms such as bacteria, many proteins which the cell is capable of synthesizing are not synthesized, although they may be synthesized under appropriate environmental conditions. When the protein product, coded by a given gene, is synthesized by the organism, the gene is said to be expressed. If the protein product is not made, the gene is not expressed. Normally, the expression of genes in E. coli is regulated as described generally, infra, in such manner that proteins whose function is not useful in a given environment are not synthesized and metabolic energy is conserved.
The means by which gene expression is controlled in E. coli is well understood, as the result of extensive studies over the past twenty years. See, generally, Hayes, W., The Genetics of Bacteria And Their Viruses, 2d edition, John Wiley & Sons, Inc., New York (1968), and Watson, J. D., The Molecular Biology of the Gene 3d edition, Benjamin, Menlo Park, Calif. (1976). These studies have revealed that several genes, usually those coding for proteins carrying out related functions in the cell, are found clustered together in continuous sequence. The cluster is called an operon. All genes in the operon are transcribed in the same direction, beginning with the codons coding for the N-terminal amino acid of the first protein in the sequence and continuing through to the C-terminal end of the last protein in the operon. At the beginning of the operon, proximal to the N-terminal amino acid codon, there exists a region of the DNA, termed the control region, which includes a variety of controlling elements including the operator, promoter and sequences for the ribosomal binding sites. The function of these sites is to permit the expression of those genes under their control to be responsive to the needs of the organism. For example, those genes coding for enzymes required exclusively for utilization of lactose are normally not appreciably expressed unless lactose or an analog thereof is actually present in the medium. The control region functions that must be present for expression to occur are the initiation of transcription and the initiation of translation. Expression of the first gene in the sequence is initiated by the initiation of transcription and translation at the position coding for the N-terminal amino acid of the first protein of the operon. The expression of each gene downstream from that point is also initiated in turn, at least until a termination signal or another operon is encountered with its own control region, keyed to respond to a different set of environmental cues. While there are many variations in detail on this general scheme, the important fact is that, to be expressed in a procaryote such as E. coli, a gene must be properly located with respect to a control region having initiator of transcription and initiator of translation functions.
It has been demonstrated that genes not normally part of a given operon can be inserted within the operon and controlled by it. The classic demonstration was made by Jacob F., et al., J. Mol. Biol. 13, 704 (1965). In that experiment, genes coding for enzymes involved in a purine biosynthesis pathway were transferred to a region controlled by the lactose operon. The expression of the purine biosynthetic enzyme was then observed to be repressed in the absence of lactose or a lactose analog, and was rendered unresponsive to the environmental cues normally regulating its expression.
In addition to the operator region regulating the initiation of transcription of genes downstream from it, there are known to exist codons which function as stop signals, indicating the C-terminal end of a given protein. See Table 3. Such codons are known as termination signals and also as nonsense codons, since they do not normally code for any amino acid. Deletion of a termination signal between structural genes of an operon creates a fused gene which could result in the synthesis of a chimeric protein consisting of two amino acid sequences coded by adjacent genes, joined by a peptide bond. That such chimeric proteins are synthesized when genes are fused was demonstrated by Benzer, S., and Champe, S. P., Proc. Nat. Acad. Sci U.S.A. 48, 114 (1962).
Once a given gene has been isolated, purified and inserted in a transfer vector, the over-all result of which is termed the cloning of the gene, its availability in substantial quantity is assured. The cloned gene is transferrred to a suitable microorganism, wherein the gene replicates as the microorganism proliferates and from which the gene may be reisolated by conventional means. Thus is provided a continuously renewable source of the gene for further manipulations, modifications and transfers to other vectors or other loci within the same vector.
Expression is obtained by transferring the cloned gene, in proper orientation and reading frame, into a control region such that read-through from the procaryotic gene results in synthesis of a chimeric protein comprising the amino acid sequence coded by the cloned gene. A variety of specific protein cleavage techniques may be used to cleave the chimeric protein at a desired point so as to release the desired amino acid sequence, which may then be purified by conventional means. Techniques for constructing an expression transfer vector having the cloned gene in proper juxtaposition with a control region are described in Polisky, B., et al., Proc. Nat. Acad. Sci U.S.A. 73, 3900 (1976); Itakura, K., et al., Science 198, 1056 (1977); Villa-Komaroff, L., et al., Proc. Nat. Acad. Sci U.S.A. 75, 3727 (1978); Mercereau-Puijalon, O., et al., Nature 275, 505 (1978); Chang, A. C. Y., et al., Nature 275, 617 (1978), and in U.S. Application Ser. No. 933,035 by Rutter, et al., said application incorporated herein by reference as though set forth in full.
In summary, the process whereby a mammalian protein, such as human pre-proinsulin or proinsulin, is produced with the aid of recombinant DNA technology first requires the cloning of the mammalian gene. Once cloned, the gene may be produced in quantity, further modified by chemical or enzymic means and transferred to an expression plasmid. The cloned gene is also useful for isolating related genes, or, where a fragment is cloned, for isolating the entire gene, by using the cloned gene as a hybridization probe. Further, the cloned gene is useful in proving by hybridization, the identity or homology of independent isolates of the same or related genes. Because of the nature of the genetic code, the cloned gene, when translated in the proper reading frame, will direct the production only of the amino acid sequence for which it codes and no other.
Some work has been performed on the isolation and purification of rat proinsulin. Ullrich, A. et al., supra, and Villa-Komaroff, L. et al., supra describe the isolation and purification of the rat proinsulin gene and a method for transferring this gene to and replicating this gene in a microorganism. Ullrich et al. recovered several recombinant plasmids which contained the coding sequence for proinsulin, the 3' untranslated region and a part of the prepeptide. Expression of the rat DNA containing the insulin coding sequence was disclosed in application No. 933,035. Villa-Komaroff et al. recovered one recombinant plasmid which contained the coding sequence for amino acids 4-86 of proinsulin. This proinsulin sequence was separated from amino acids 24-182 of penicillinase, (.beta.-lactamase) by the coding sequence for six glycines. This penicillinase-proinsulin coding sequence was expressed to produce a fused protein. These articles describe some of the basic procedures utilized in recombinant DNA technology. However, they do not describe the isolation and purification of the human pre-proinsulin gene or human proinsulin gene.
A different gene approach to obtain human insulin has been taken by Crea, R. et al., Proc. Nat. Acad. Sci U.S.A. b 75, 5765 (1978). This approach is to chemically synthesize coding sequences for (1) the A chain and (2) the B chain of human insulin, using codons favored by E. Coli. These two sequences can then be inserted into plasmids which can be expressed to produce the A and B chains. Human insulin could then be generated by formation of the correct disulfide bonds between the two protein chains.
The cloned gene for human pre-proinsulin is useful in a variety of ways. Transposition to an expression transfer vector will permit the synthesis of pre-proinsulin by a host microorganism transformed with the vector carrying the cloned gene. Growth of the transformed host will result in synthesis of pre-proinsulin as part of a chimeric protein. If the procaryotic portion of the fusion protein is the signal portion of an excreted or otherwise compartmentalized host protein, excretion or compartmentalization can occur greatly enhancing the stability and ease of purification of the pre-proinsulin fusion protein. Additionally, where the procaryotic portion is short, excretion from the procaryotic host may be facilitated by the prepeptide itself, if the pre-sequence functions in the procaryotic host as it does in the eucaryotic cell. The pre-proinsulin gene may also be used to obtain the proinsulin gene using techniques as described below.
The cloned pre-proinsulin gene can be used in a variety of techniques for the production of pre-proinsulin. Pre-proinsulin itself is useful because it can be converted to proinsulin by known enzymatic and chemical techniques. For example, the prepeptide can be removed by a soluble enzymatic preparation, as described by Blobel, G. et al., supra, specific for removal of signal peptides. The cloned proinsulin gene can be used in a variety of techniques for the production of proinsulin. The proinsulin, produced from either gene, itself is useful because it can be converted to insulin by known enzymatic and chemical techniques. See Kemmber, W., et al., J. Biol. Chem. 242, 6786 (1971).