This invention relates to the fields of Molecular Biology and Molecular Genetics with specific reference to the identification and isolation of proteins and of the genes and transcripts that encode them.
The primary area of the inventionxe2x80x94the identification and tagging of genes and proteinsxe2x80x94has received a great deal of attention, and many successful methods have been devised. None of these methods, however, has the feature of tagging gene, transcript and protein in a single event
Linkage Analysis. Genes have traditionally been identified by identifying mutations and then mapping them with respect to one another by means of genetic crosses. This kind of mapping, or linkage analysis, does not serve to isolate the genes themselves nor does it indicate anything about the genes"" molecular structure or function. In recent years, a form of linkage analysis using restriction fragment length polymorphisms (RFLPs) has come into use. This method serves to identify DNA sequences that are linked to a gene of interest, and, having identified such a DNA sequence, it is possible in principle, and sometimes in practice, to identify and clone the gene itself by performing chromosome walks or jumps. It should be stressed that, even when successful, this strategy identifies the gene, not the protein encoded by the gene.
Transposon Tagging. Another technique for cloning genes that has been developed relatively recently goes by the name transposon tagging. In this technique, mutations due to the insertion of transposable elements into new sites in the genome are identified, and the genes in which the transposons lie can then be cloned using transposon DNA as a molecular probe. Transposon tagging, like RFLP/linkage analysis, identifies genes, not proteins.
Enhancer Trapping. Another method for identifying genes, enhancer trapping, involves the random insertion into a eucaryotic genome of a promotor-less foreign gene (the reporter) whose expression can be detected at the cellular level. Expression of the reporter gene indicates that it has been fused to an active transcription unit or that it has inserted into the genome in proximity to cis-acting elements that promote transcription. This approach has been important in identifying genes that are expressed in a cell type-specific or developmental stage-specific manner. Enhancer trapping, like RFLP/linkage analysis and transposon tagging, identifies genes, not proteins, and it does not directly reveal anything about the nature of the protein product of a gene.
Guest Peptides and Epitope tagging. A number of studies have been performed in which new peptides have been inserted into proteins at a variety of positions by modifying the genes encoding the proteins using recombinant DNA technology. The term xe2x80x98guest peptide xe2x80x99 has been used to describe the foreign peptides in these cases. It is clear that in many cases the presence of such peptides is relatively innocuous and does not substantially compromise protein functionxe2x80x94especially in those cases where the peptide is on the surface of the protein rather than in its hydrophobic core.
Epitope tagging is a method that utilizes antibodies against guest peptides to study protein localization at the cellular level and subcellular levels. Epitope tagging begins with a cloned gene and an antibody that recognizes a known peptide (the epitope). Using recombinant DNA technology, a sequence of nucleotides encoding the epitope is inserted into the coding region of the cloned gene, and the hybrid gene is introduced into a cell by a method, such as transformation. When the hybrid gene is expressed, the result is a chimeric protein containing the epitope as a guest peptide. If the epitope is exposed on the surface of the protein, it is available for recognition by the epitope-specific antibody, allowing the investigator to observe the protein within the cell using immunofluorescence or other immunolocalization techniques. Epitope tagging serves to mark proteins of already-cloned genes but does not serve to identify genes.
Isolating Genes Beginning with the Proteins they Encode. A number of procedures have been developed for isolating genes beginning with the proteins that they encode. Some, such as expression library screening, involve the use of specific antibodies that react to the protein of interest. Others involve sequencing all or part of the protein and designing oligonucleotide probes that can be used to identify the gene by DNA/DNA hybridization. In all of these cases, one must have specific knowledge about a protein before it is possible to take steps to clone and characterize the gene that encodes it.
cDNA Cloning and Sequencing. A method of gene identification that has received a great deal of attention in the recent past is the cloning (and in many instances, sequencing) of so-called expressed sequence tags (ESTs) from cDNA libraries made from mRNA extracted from a given tissue or cell type. Information about the proteins encoded by the mRNAs can be derived from the cDNA sequences by identifying and analyzing their open reading frames. In many cases, such cDNAs are not full length, however, and so information about the amino-terminal portion of the protein is lacking. And, more significantly, the method tags transcript sequences and not the proteins that the transcripts encode.
RNA Splicing. RNA splicing is the natural phenomenon, characteristic of all eucaryotic cells, whereby introns are removed from primary RNA transcripts. A large body of research has revealed that an intron is functionally defined by three componentsxe2x80x94a 5xe2x80x2 donor site, a branch site and a 3xe2x80x2 acceptor site. If these sites are present, and if the intron is not too large (it can be at least as large as 2 kb in many organisms), and if the distance between the branch and 3xe2x80x2 acceptor sites is appropriate, the cellular splicing machinery is activated, and the intron is removed from the transcript. Many different natural DNA sequences are known to have splice site function; consensus sites for mammalian splicing are indicated in FIG. 1. Thus, not only have many active splice sites been cloned, but there is a large database that can be used to design synthetic functional splice site sequences.
Gene Trapping. Gene trapping is a method used to identify transcribed genes. Gene trapping vectors carry splice acceptor sites directly upstream of the coding sequence for a reporter protein, such as xcex2-galactosidase. When the vector inserts into an intron of an actively transcribed gene, the result is a protein fusion between an N-terminal fragment of the target gene-product and the reporter protein, the activity of which is used as an indicator that integration into an active gene has occurred. Gene trapping seeks to identify transcribed genesxe2x80x94not to tag proteins, and to inactivate genesxe2x80x94not to produce an active tagged gene product.
xe2x80x9cCD-DNAxe2x80x9d and xe2x80x9cCD-Taggingxe2x80x9d. The so-called central dogma of genetics states that information flows from DNA to RNA to protein. The method of this invention tags each of the classes of macromolecule included in the central dogma. Accordingly, the method is referred to herein as xe2x80x9cCD-taggingxe2x80x9d. Likewise, the term xe2x80x9cCD-DNAxe2x80x9d is used herein to refer to a DNA molecule that is inserted into the genome using the method of this invention.
Identifying and Isolating Proteins. RNAs and Genes, A method that allows one to readily identify genes by observing tagged proteins ought to be of great advantage relative to the prior art. CD-tagging has just this feature. In particular, when the protein tag is an epitope that is recognized by a particular antibody, cells can be treated with a CD-DNA, or with DNA constructs containing a CD-DNA, and then subjected to immunological screens or selections to identify the epitope tag. Many different screens or selections are possible, each of which has its own particular advantages. These include direct or indirect immunofluorescence by which tagged proteins can be localized to particular regions or subcellular structures within a cell, immunoblot analysis by which the abundance, molecular weight and isoelectric points of tagged proteins can be determined, enzyme linked immune-assays (ELISAs) by which internal or secreted tagged proteins can be distinguished, and fluorescence-activated cell sorting (FACS) by which living cells with tagged proteins at their surfaces can be obtained.
Once proteins and genes of interest have been identified, they can be efficiently purified using standard hybridization and/or affinity-purification methods that take advantage of their specific tags.
Large Target Size in the Genome. CD-tagging depends on the insertion of a CD-DNA into an intron. Since higher eucaryotic genes contain much more intron than exon sequence, the target size is large relative to any other tagging method in which the DNA must insert into an exon. Further, since the typical gene contains numerous introns, the boundaries of which determine the sites at which amino acid insertions in the protein can be produced by CD-tagging, it is likely that for a given protein there exist multiple sites at which peptide tags produced by CD-DNA insertions would not seriously compromise protein function. Indeed, there is some evidence that the sites in many proteins that are determined by the exon/intron boundaries are particularly likely to be on the surface of the proteinxe2x80x94at an ideal location to accept a guest peptide and to allow recognition of that peptide by an antibody.
Hybrid Proteins ire Expressed in Backgrounds where Normal Genes Are Also Present. As discussed earlier, experience has shown that in many, and perhaps most, cases epitope fusion proteins have normal, or nearly normal, activity. But even this is not a requirement in order for CD-tagging to be useful in identifying proteins and their genes because in many applications one or more copies of the normal gene will be present in addition to the tag-containing gene (e.g., when diploid cells are tagged); here the tagged protein need not be fully functional as long as it can, for example, co-assemble at its normal location along with the protein encoded by the unaltered gene.
Obtaining Sequence Data, Once an organism or cell line expressing a protein of interest has been identified using; the method of the invention, a DNA representing a portion of mRNA encoding the protein can be obtained by standard techniques such as plasmid rescue or amplifying the sequence of interest from cDNA by means of the polymerase chain reaction (PCR) using poly-dT as one primer and a DNA complimentary to the tag-encoding sequence as the other. The amplified DNA can then be sequenced by standard methods. Knowledge of the sequence can then be used to design primers for amplification from genomic DNA in order to obtain genomic sequence information.
Application to Analysis of Subcellular Structures. One important application for CD-tagging is to identify proteins, and the genes encoding them, that are present in particular subcellular structures. This can be done by screening CD-DNA recipients for those that express the protein tag in the structure of interest. A significant advantage of this approach is that it does not depend upon the purification of the structure of interest, or even on the prior existence of a method for such purification, as traditional methods for characterizing subcellular structures do.
In addition to identifying proteins in known structures, CD-tagging holds the promise of identifying new structures, and the proteins they contain, that have not been explicitly recognized before.
Application to the analysis of cellular responses. CD-tagging can be used to identify proteins, and the genes encoding them, whose synthesis is stimulated by a particular treatment, such as the administration of a particular hormone or growth factor to a particular cell type. This can be accomplished by comparing treated and untreated cells to identify proteins whose levels change in response to the treatment. And, using standard immunocytochemical methods, one can discriminate among such proteins to identify those that are secreted, localized to the cell surface, or present in particular subcellular compartments.
Application to Virology. Viral infection often leads to specific changes in cellular gene expression. Using CD-tagging, cellular genes whose expression is up or down-regulated can be identified by comparing the levels of tagged proteins in infected versus uninfected cells. Likewise, if the viral genome is tagged, the expression of viral proteins during the viral life cycle can be observed.
Application to Analysis of Transcriptional Regulation. Much genetic regulation occurs at the level of transcriptions. Because CD-tagging puts a unique tag into mRNA species derived from a tagged gene, the tag can be used to investigate mRNA synthesis and stability.
Application to the Analysis of the Human Genome. Because most cellular functions are mediated by proteins, it is of particular interest in the context of the comprehensive analysis of the human genome to identify those parts of the genome that are expressed in the form of proteins. CD-tagging provides an efficient general method to directly identify new genes on the basis of their expression as proteins and on the basis of the location of those proteins in particular cellular or extracellular structures. In addition, CD-tagging provides a method for efficient physical and/or RFLP mapping of genes, as well as a method for the isolation of genes and transcripts via their nucleic acid tags and for the efficient purification of proteins via their epitope tags.
CD-tagging has specific advantages over the prior art method for identifying and mapping genes using expressed sequence tags (ESTs). ESTs are CDNA sequences, not genomic sequences. Thus an EST probe will hybridize not only to the true gene but to any pseudogenes that are present in the genome, thereby limiting its usefulness for mapping and cloning the true gene. Likewise, an EST probe may hybridize with closely related members of a gene family, again limiting its usefulness as a probe for a unique sequence. These limitations do not apply if a gene is identified by CD-tagging, since the method provides direct access, through the CD-DNA tag, to the true gene.
Applications to Medicine. CD-tagging has broad application to the analysis and diagnosis of disease. With regard to analysis, CD-tagging makes it possible to demonstrate, through linkage analysis, that a defect with respect to a given protein represents the primary defect for a given genetic disease or cancer. The function of the protein can then be examined in detail to gain new understanding of the biology of the disease.
With regard to diagnosis, genes that are isolated using CD-tagging can provide probes to identify disease-associated restriction fragment length polymorphisms, and they can provide primers by which mutations responsible for genetic diseases could be precisely identified. Once such polymorphisms or mutations have been identified, diagnostic tests for the presence of mutant alleles in homozygous or heterozygous individuals can be developed using standard approaches. Likewise, proteins that are isolated using the invention can be used as antigens to develop antibodies that can be used to make molecular diagnoses for a particular genetic diseases. With regard to therapy, genes or proteins that are identified using CD-tagging could be used to treat a wide variety of infectious and non-infectious diseases.
The invention utilizes a xe2x80x9cCD-DNAxe2x80x9d molecule that contains acceptor and donor sites for RNA splicing. Between the acceptor and donor sites is a sequence of nucleotides that encodes a particular peptide (or set of three peptides, one for each possible reading frame). When the CD-DNA is inserted into an existing intron, it creates a new peptide-encoding exon surrounded by two hybrid, but functional, introns. The result is that, after transcription, RNA splicing and translation, a protein is produced that contains the peptide located precisely between the amino acids encoded by the exons that surrounded the target intron. Thus, in a single recombination event at the DNA level, 1) the gene encoding the protein is tagged by the CD-DNA sequence for recognition by a DNA probe or primer, 2) the RNA transcript encoding the protein is tagged by the peptide-encoding sequence for recognition by a DNA probe or primer, and 3) the protein is tagged by the peptide for recognition by a specific antibody or other reagent.