The invention relates to methods for identifying genes encoding novel proteins.
There is considerable medical interest in secreted and membrane-associated mammalian proteins. Many such proteins, for example, cytokines, are important for inducing the growth or differentiation of cells with which they interact or for triggering one or more specific cellular responses.
An important goal in the design and development of new therapies is the identification and characterization of secreted proteins and the genes which encode them. Traditionally, this goal has been pursued by identifying a particular response of a particular cell type and attempting to isolate and purify a secreted protein capable of eliciting the response. This approach is limited by a number of factors. First, certain secreted proteins will not be identified because the responses they evoke may not be recognizable or measurable. Second, because in vitro assays must be used to isolate and purify secreted proteins, somewhat artificial systems must be used. This raises the possibility that certain important secreted proteins will not be identified unless the features of the in vitro system (e.g., cell line, culture medium, or growth conditions) accurately reflect the in vivo milieu. Third, the complexity of the effects of secreted proteins on the cells with which they interact vastly complicates the task of isolating important secreted proteins. Any given cell can be simultaneously subject to the effects of two or more secreted proteins. Because any two secreted proteins will not have the same effect on a given cell and because the effect of a first secreted protein on a given cell can alter the effect of a second secreted protein on the same cell, it can be difficult to isolate the secreted protein or proteins responsible for a given physiological response. In addition, certain secreted and membrane-associated proteins may be expressed at levels that are too low to detect by biological assay or protein purification.
In another approach, genes encoding secreted proteins have been isolated using DNA probes or PCR oligonucleotides which recognize sequence motifs present in genes encoding known secreted protein. In addition, homology-directed searching of Expressed Sequence Tag (EST) sequences derived by high-throughput sequencing of specific cDNA libraries has been used to identify genes encoding secreted proteins. These approaches depend for their success on a high degree of similarity between the DNA sequences used as probes and the unknown genes or EST sequences.
More recently, methods have been developed that permit the identification of cDNAs encoding a signal sequence capable of directing the secretion of a particular protein from certain cell types. Both Honjo, U.S. Pat. No. 5,525,486, and Jacobs, U.S. Pat. No. 5,536,637, describe such methods. These methods are said to be capable of identifying secreted proteins.
The demonstrated clinical utility of several secreted proteins in the treatment of human disease, for example, erythropoietin, granulocyte-macrophage colony stimulating factor (GM-CSF), human growth hormone, and various interleukins, has generated considerable interest in the identification of novel secreted proteins. The method of the invention can be employed as a tool in the discovery of such novel proteins.
The invention features a method for isolating cDNAs and identifying encode secreted or membrane-associated (e.g. transmembrane) mammalian proteins. The method of the invention relies upon the observation that the majority of secreted and membrane-associated proteins possess at their amino termini a stretch of hydrophobic amino acid residues referred to as the xe2x80x9csignal sequence.xe2x80x9d The signal sequence directs secreted and membrane-associated proteins to a sub-cellular membrane compartment termed the endoplasmic reticulum, from which these proteins are dispatched for secretion or presentation on the cell surface.
The invention describes a method in which cDNAs that encode signal sequences for secreted or membrane-associated proteins are isolated by virtue of their abilities to direct the export of the reporter protein, alkaline phosphatase (AP), from mammalian cells. The present method has major advantages over other signal peptide trapping approaches. The present method is highly sensitive. This facilitates the isolation of signal peptide associated proteins that may be difficult to isolate with other techniques. Moreover, the present method is amenable to throughput screening techniques and automation. Combined with a novel method for cDNA library construction in which directional random primed cDNA libraries are prepared, the invention comprises a powerful and approach to the large scale isolation of novel secreted proteins.
The invention features a method for identifying a cDNA nucleic acid encoding a mammalian protein having a signal sequence, which method includes the following steps:
a) providing library of mammalian cDNA;
b) ligating the library of mammalian cDNA to DNA encoding alkaline phosphatase lacking both a signal sequence and a membrane anchor sequence to form ligated DNA;
c) transforming bacterial cells with the ligated DNA to create a bacterial cell clone library;
d) isolating DNA comprising the mammalian cDNA from at least one clone in the bacterial cell clone library;
e) separately transfecting DNA isolated from clones in step (d) into mammalian cells which do not express alkaline phosphatase to create a mammalian cell clone library wherein each clone in the mammalian cell clone library corresponds to a clone in the bacterial cell clone library;
f) identifying a clone in the mammalian cell clone library which express alkaline phosphatase;
g) identifying the clone in the bacterial cell clone library corresponding to the clone in the mammalian cell clone library identified in step (f); and
h) isolating and sequencing a portion of the mammalian cDNA present in the bacterial cell library clone identified in step (g) to identify a mammalian cDNA encoding a mammalian protein having a signal sequence.
A cDNA library is a collection of nucleic acid molecueles that are a cDNA copy of a sample of mRNA.
In another aspect, the invention features ptrAP3 expression vector.
In another aspect, the invention features a substantially pure preparation of ethb0018f2 protein. Preferably, the ethb0018f2 protein includes an amino acid sequence substantially identical to the amino acid sequence shown in FIG. 5 (SEQ ID NO: 5); is derived from a mammal, for example, a human.
The invention also features purified DNA (for example, cDNA) which includes a sequence encoding a ethb0018f2 protein, preferably encoding a human ethb0018f2 protein (for example, the ethb0018f2 protein of FIG. 5; SEQ ID NO:5); a vector and a cell which includes a purified DNA of the invention; and a method of producing a recombinant ethb0018f2 protein involving providing a cell transformed with DNA encoding ethb0018f2 protein positioned for expression in the cell, culturing the transformed cell under conditions for expressing the DNA, and isolating the recombinant ethb0018f2 protein. The invention further features recombinant ethb0018f2 protein produced by such expression of a purified DNA of the invention.
By xe2x80x9cethb0018f2 proteinxe2x80x9d is meant a polypeptide which has a biological activity possessed by naturally-occuring ethb0018f2 protein. Preferably, such a polypeptide has an amino acid sequence which is at least 85%, preferably 90%, and most preferably 95% or even 99% identical to the amino acid sequence of the ethb0018f2 protein of FIG. 5 (SEQ ID NO: 5).
By xe2x80x9csubstantially identicalxe2x80x9d is meant a polypeptide or nucleic acid having a sequence that is at least 85%, preferably 90%, and more preferably 95% or more identical to the sequence of the reference amino acid or nucleic acid sequence. For polypeptides, the length of the reference polypeptide sequence will generally be at least 16 amino acids, preferably at least 20 amino acids, more preferably at least 25 amino acids, and most preferably 35 amino acids. For nucleic acids, the length of the reference nucleic acid sequence will generally be at least 50 nucleotides, preferably at least 60 nucleotides, more preferably at least 75 nucleotides, and most preferably 110 nucleotides.
Sequence identity can be measured using sequence analysis software (e.g., Sequence Analysis Software Package of the Genetics Computer Group, University of Wisconsin Biotechnology Center, 1710 University Avenue, Madison, Wis. 53705).
In the case of polypeptide sequences which are less than 100% identical to a reference sequence, the non-identical positions are preferably, but not necessarily, conservative substitutions for the reference sequence. Conservative substitutions typically include substitutions within the following groups: glycine and alanine; valine, isoleucine, and leucine; aspartic acid and glutamic acid; asparagine and glutamine; serine and threonine; lysine and arginine; and phenylalanine and tyrosine.
Where a particular polypeptide is the to have a specific percent identity to a reference polypeptide of a defined length, the percent identity is relative to the reference peptide. Thus, a peptide that is 50% identical to a reference polypeptide that is 100 amino acids long can be a 50 amino acid polypeptide that is completely identical to a 50 amino acid long portion of the reference polypeptide. It might also be a 100 amino acid long polypeptide which is 50% identical to the reference polypeptide over its entire length. Of course, many other polypeptides will meet the same criteria.
By xe2x80x9cproteinxe2x80x9d and xe2x80x9cpolypeptidexe2x80x9d is meant any chain of amino acids, regardless of length or post-translational modification (e.g., glycosylation of phosphorylation).
By xe2x80x9csubstantially purexe2x80x9d is meant a preparation which is at least 60% by weight (dry weight) the compound of interest, i.e., a ethb0018f2 protein. Preferably the preparation is at least 75%, more preferably at least 90%, and most preferably at least 99%, by weight the compound of interest. Purity can be measured by any appropriate method, e.g., column chromatography, polyacrylamide gel electrophoresis, or HPLC analysis.
By xe2x80x9cpurified DNAxe2x80x9d is meant DNA that is not immediately contiguous with both of the coding sequences with which it is immediately contiguous (one on the 5xe2x80x2 end and one on the 3xe2x80x2 end) in the naturally occurring genome of the organism from which it is derived. The term therefore includes, for example, a recombinant DNA which is incorporated into a vector; into an autonomously replicating plasmid or virus; or into the genomic DNA of a prokaryote or eukaryote, or which exists as a separate molecule (e.g., a cDNA or a genomic DNA fragment produced by PCR or restriction endonuclease treatment) independent of other sequences. It also includes a recombinant DNA which is part of a hybrid gene encoding additional polypeptide sequence.
By xe2x80x9csubstantially identicalxe2x80x9d is meant an amino acid sequence which differs only by conservative amino acid substitutions, for example, substitution of one amino acid for another of the same class (e.g., valine for glycine, arginine for lysine, etc.) or by one or more non-conservative substitutions, deletions, or insertions located at positions of the amino acid sequence which do not destroy the function of the protein (assayed, e.g., as described herein). Preferably, such a sequence is at least 85%, more preferably 90%, and most preferably 95% identical at the amino acid level to the sequence of FIG. 5 (SEQ ID NO: 5). For nucleic acids, the length of comparison sequences will generally be at least 50 nucleotides, preferably at least 60 nucleotides, more preferably at least 75 nucleotides, and most preferably 110 nucleotides. A xe2x80x9csubstantially identicalxe2x80x9d nucleic acid sequence codes for a substantially identical amino acid sequence as defined above.
By xe2x80x9ctransformed cellxe2x80x9d is meant a cell into which (or into an ancestor of which) has been introduced, by means of recombinant DNA techniques, a DNA molecule encoding (as used herein) ethb0018f2 protein.
By xe2x80x9cpositioned for expressionxe2x80x9d is meant that the DNA molecule is positioned adjacent to a DNA sequence which directs transcription and translation of the sequence (i.e., facilitates the production of ethb0018f2 protein).
By xe2x80x9cpurified antibodyxe2x80x9d is meant antibody which is at least 60%, by weight, free from the proteins and naturally-occurring organic molecules with which it is naturally associated. Preferably, the preparation is at least 75%, more preferably at least 90%, and most preferably at least 99%, by weight, antibody.
By xe2x80x9cspecifically bindsxe2x80x9d is meant an antibody which recognizes and binds ethb0018f2 protein but which does not substantially recognize and bind other molecules in a sample, e.g., a biological sample, which naturally includes ethb0018f2 protein.
Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Although methods and materials similar or equivalent to those described herein can be used in the practice or testing of the present invention, the preferred methods and materials are described below. All publications, patent applications, patents, and other references mentioned herein are incorporated by reference in their entirety. In case of conflict, the present specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and not intended to be limiting.
Other features and advantages of the invention will be apparent from the following detailed description, and from the claims.