1. Field of the Invention
The present invention relates to processes of using microorganisms to measure or test interaction between nucleic acids and protein. The present invention more specifically relates to an improved method for the in vivo identification and optional characterization of genomic DNA sequences that interact with DNA-binding proteins. The present invention further relates to a kit useful for carrying out the method of the invention. The present invention further provides vectors and vector components configured for expression of fusion proteins in yeast and bacteria, or for cloning of genomic DNA. The present invention also provides vectors and vector components that allow inserted nucleic acid sequences that are deleterious to a host cell to be cloned successfully.
2. Description of Related Art
Numerous biologically important functions involve transient interactions between DNA molecules and proteins, RNA molecules and proteins, two or more proteins or RNA molecules, or ligands and receptors. Recognition and binding of sequence-specific DNA-binding proteins (e.g., transcription factors) to regulatory elements within the genome—which often lie outside the regions of the genome that are contained within cDNA libraries—is a critical component of the spatio-temporal control of gene expression, directing epigenetic controls important for proper cellular function in all organisms. Conservation of these control mechanisms ensures proper replication and cell division. Conversely, their alteration (e.g., modifications causing changes in the expression or binding capacity of transcription factors) is often implicated in a cell's transition to a malignant state.
For example, alveolar rhabdomyosarcoma (ARMS) is a form of cancer characterized by a t(2;13)(q35;q14) chromosomal translocation that results in the fusion of two myogenic transcription factors: Pax3 and FKHR (FOX01a). The term “transcription factor” describes any protein required to initiate or regulate DNA transcription in eukaryotes. ARMS is an aggressive solid muscle tumor occurring predominantly in children. It has a poor prognosis, and an approximate event-free four-year survival rate of only 17%. Despite the identification and characterization of the oncogenic fusion protein Pax3-FKHR, little is known about the genes directly regulated by Pax3 or FKHR, or how their expression may be altered by the Pax3-FKHR fusion protein.
While many techniques exist to investigate the possible gene targets and binding specificities of different transcription factors, they either are too labor-intensive to be useful in a genomic screen, fail to use and cannot be adapted to use genomic DNA, or are subject to such levels of inherent inefficiency as to be inadequate.
Many genes of higher eukaryotes are transcribed into mRNA only in specific cell-types. For example, reticulocytes (immature red blood cells) contain mRNA for hemoglobin—the iron-containing oxygen-transport metalloprotein in red blood cells—while nerve cells do not. The particular DNA sequences that encode the mRNA in a cell can be cloned by using retroviral reverse transcriptase to make DNA copies of the mRNA (the copies are called “complimentary DNA,” or cDNA clones) isolated from the cell. These single-stranded cDNA clones are converted into double-stranded DNAs and cloned into plasmid vectors, creating a cDNA library for that particular cell-type. cDNA libraries contain only sequences expressed as mRNA in the particular cell-type used to generate the library, but they lack the intronic (intragenic), non-coding sequences of genomic DNA, which were spliced out of the transcribed RNA seqeunces by posttranscriptional modification. cDNA libraries also contain 5′ and 3′ untranslated regions (5′-UTR and 3′-UTR), which are non-coding nucleotide regions at either end of each mRNA molecule, and derive from DNA adjacent to the gene. The 5′- and 3′-UTRs may contain protein binding sites, and can be involved in regulating expression of the adjacent gene.
In many eukaryotes, a large percentage of the total genome is comprised of non-coding DNA that does not lie near any gene. It is also clear, however, that gene transcription is often stimulated by DNA regions called “enhancers,” which contain protein binding sites and may be located in non-coding regions tens of thousands of base pairs upstream or downstream from the transcriptional start site. Many mammalian genes are regulated by more than one enhancer region, and their identification and characterization represents a difficult problem. While a cDNA library can help identify the chromosomal location of a gene, it cannot reveal the locations of enhancers. A cDNA library is also of limited use in identifying promoter-proximal elements, which are non-coding regions that lie much closer to transcriptional start sites (e.g., 100-200 base pairs upstream) and also provide protein binding sites, but which are not contained within mRNA, and so are not contained in cDNA libraries. Still, the relative proximity of promoter elements makes them easier to find than enhancers. Because enhancer and promoter elements are so fundamental to the regulation of transcription, and because the dysregulation of transcription can lead to disease, methods of identifying and characterizing enhancer and promoter have generated tremendous interest.
Study of DNA outside the immediate vicinity of genes—outside the regions covered by cDNA libraries—necessitates the use of genomic DNA libraries. Genomic DNA is all the DNA sequences comprising the genome (the total genetic information carried) of a cell or organism, and a genomic DNA library is a collection of clones that contains the entire genome Like cDNA libraries, genomic DNA libraries are often contained within plasmid vectors. However, genomic DNA libraries are derived directly from genomic DNA, not mRNA, and so contain non-coding DNA (including introns) as well as coding DNA (exons). Creating genomic DNA libraries is difficult, however, because of the relatively low efficiency of E. coli transformation and the number of colonies that can be grown on a culture plate. A genomic DNA library must contain a sufficient number of independently-derived clones that the probability is high (≧95%) that every DNA sequence of the organism is contained within the library. The difficulty of creating such libraries is compounded by the effects of some cloned genomic DNA fragments, which may contain promoter or enhancer elements, sequences that encode toxic peptides, or other unstable elements. For example, a clone containing a promoter or enhancer may drive transcription into the plasmid vector, thus interfering with the vector's replication or expression of drug resistance. The resulting library would lack genomic DNA clones bearing those sequences because bacteria bearing those clones would die, yet those are some of the very sequences that are the object of study by the methods of this invention.
Mutation of either a DNA-binding protein or a genomic regulatory element may disrupt their ability to interact, thereby producing dire consequences by altering the biological processes under their control. Such mutations can form the basis of congenital diseases, or of certain cancers. While many DNA-binding proteins and the nucleic acid sequences they recognize have been identified, there remains a need for improved methods to investigate and identify the manner in which they interact, the genomic contexts of these sequences, the downstream genes they in turn control, the biological processes they regulate.
Therefore, identifying the regulatory elements in a genomic DNA context is critical not only for understanding their role in normal biological activities but in determining the underlying molecular mechanisms that contribute to genetic disorders and the diseased state.
Classical methods for identifying interactions between nucleic acids and proteins—e.g., co-immunoprecipitation, cross-linking, or gel-shift mobility assay—are not available for all proteins, and may not be sufficiently sensitive. Furthermore, these methods are difficult, time-consuming, involve hazardous materials, and are not amenable to screening large populations of potentially interacting partners. The yeast two-hybrid (Y2H) system (Fields and Song 1989; see also U.S. Pat. No. 5,955,280) represented a ground-breaking development in the identification of novel protein-protein interactions, and points the way to methods for identifying interactions between nucleic acids and proteins.
The Y2H system allows rapid demonstration of in vivo interactions between proteins, along with easy isolation of the nucleic acid sequences that encode the interacting proteins. The Y2H system exploits one of the features shared by many eukaryotic transcription factors that carry two separable, functional domains: a first domain serves to recognize and bind to specific DNA sequences (the DNA binding domain, or “DB”); and a second domain activates the RNA-polymerase complex (the activation domain, or “AD”). In a typical Y2H screening paradigm, a “bait” protein is expressed in yeast cells as a fusion protein comprising a DNA binding domain (e.g., the GAL4 DB) and a protein of interest (“X”). Concurrently, the same yeast cell expresses a “fish” protein as a fusion protein comprising an activation domain (e.g., the GAL4 AD) and another protein of interest (“Y”). Any interaction between the X and Y moieties of the bait and fish fusion proteins, respectively, also brings the DNA binding and activation domains of the fusion proteins into close proximity. The result is a protein complex comprising X, Y, a DNA binding domain, and an activation domain. The DNA binding domain of the complex binds a cognate DNA sequence, while the activation domain of that complex triggers expression of a reporter gene (e.g., HIS3 or lacZ).
Expression of the reporter gene allows identification and selection of yeast cells containing interacting proteins X and Y. For example, by culturing yeast that are auxotrophic for histidine on media lacking histidine, only yeast cells bearing interacting X and Y proteins will grow and form colonies because only those cells will express histidine. Such colonies can be identified visually on solid media, isolated, and subjected to further analysis. For example, the genetic sequence corresponding to protein X may be determined by isolating the corresponding plasmid DNA and subjecting it to sequence analysis.
Many variants of the Y2H system exist (see, e.g., U.S. Pat. No. 5,955,280). For example, a “reverse two-hybrid” (R2H) system permits identification of interaction between proteins (just as with the traditional Y2H system), but through counterselection techniques also allows testing of the relative strength of that interaction. For example, expression of the URA3 gene, which encodes orotidine-5′-phosphate, is lethal to yeast in the presence of medium containing 5-fluoroorotic acid (5-FOA). Yeast expressing URA3 can also be identified by growing them on media lacking uracil. Thus, depending on growth medium composition, URA3 can be used either for positive or negative selection—it is a selectable/counterselectable reporter gene.
Furthermore, expression of a counterselectable reporter gene is useful in identifying mutations that disrupt interactions between proteins. For example, if the interaction of X and Y moieties (on bait and fish fusion proteins, respectively) triggers expression of the URA3 gene, yeast expressing X and Y will not grow on media containing 5-FOA. However, if X and Y can no longer interact (e.g., because of a fortuitous or an intentional mutation in either moiety), yeast expressing the disruptive mutation(s) will now be able to grow on media containing 5-FOA but will not be able to grow on media lacking uracil. Thus, these techniques enable not just identification of interacting proteins, but also the analysis of points of contact between partners.
Although eukaryotic protein-protein interactions can be studied with relative ease using Y2H systems, identifying interactions between genomic DNA and proteins remains difficult. While many DNA-binding proteins and their cognate nucleic acid sequences are known, the genomic context of these sequences, the genes they regulate, and the biological processes they control remain unknown. Furthermore, screening of genomic libraries for sequences recognized by DNA-binding proteins using conventional techniques is simply too expensive, cumbersome, time-consuming, and unreliable.
The yeast one-hybrid (Y1H) system (Li and Herskowitz, 1993), derived from the Y2H system for detecting protein-protein interactions, provided the first in vivo method to isolate and identify a protein that interacts with a known DNA sequence. Briefly, a library of genomic yeast DNA sequences was cloned into an expression vector upstream of and in frame with a GAL4 activation domain sequence, producing protein coding sequences fused to the GAL4 AD—an expression library. The expression library was transformed into a yeast reporter strain containing a lacZ reporter gene under the control of four copies of a yeast autonomous replicating sequence (ARS) consensus sequence (ACS). Hybrid proteins that recognized the ACS binding site activated transcription of lacZ, turning the cell blue in a β-galactosidase assay.
The methods of the present invention bear similarities to the yeast one-hybrid system (Li & Herskowitz, 1993). The yeast one-hybrid system uses an oligonucleotide, containing a known DNA recognition site, as “bait” for unknown DNA-binding proteins. In contrast, the methods of the present invention employ known or putative DNA-binding proteins as “bait” to screen a stable genomic DNA library containing all DNA recognition sites within the genome, both known and unknown. The yeast one-hybrid system described above uses a genomic DNA library contained in an expression vector, a system that inherently introduces bias to the screening method. In contrast, the methods of the present invention use a stable genomic library designed to eliminate such bias.
While it is theoretically possible to reverse the standard Y1H screen, using unknown genomic DNA fragments to identify promoter elements directly bound by a known DNA-binding protein (e.g., a transcription factor), all prior reports of Y1H screens have failed to appreciate or anticipate that the expression library used is biased because the plasmid vector itself can drive transcription and translation of the inserted DNA, resulting in sequence rearrangements, small deletions in the insert, or outright loss of the insert. Additionally, the DNA-binding protein expressed from the inserted DNA may be toxic to host cell. Furthermore, fusion of the yeast transcriptional activation domain to the carboxyl terminus of the DNA-binding protein expressed from the DNA inserted in to a vector may inhibit the ability of the DNA-binding protein to interact with its recognition sequence, while its fusion to the amino terminus of the DNA-binding protein expressed from the DNA inserted in to a vector may be toxic to host cells. Alternatively, if genomic DNA inserted into a vector contains a promoter or enhancer sequence itself, it too may drive transcription and result in unintended or toxic effects. Therefore, such genomic DNA sequences will not be identified to any DNA-binding protein because the deleterious effects they produce in conventional Y1H systems will delete them from the genomic library. Unfortunately, such missing sequences are likely the very objects of a Y1H screen. Thus, the prior art fails to recognize that potentially meaningful and important interaction candidates are eliminated from most Y2H and Y1H library screens, for numerous reasons, and fails to teach methods of overcoming this limitation.
Another conventional method of identifying genomic regulatory elements that are recognized and bound by specific DNA-binding proteins is chromatin immunoprecipitation (ChIP), and its variants: ChIP paired-end diTag (ChIP-PET) sequencing; and ChIP microarray (ChIP-chip). ChIP (Orlando et al., 1997) is a procedure used to determine whether a known protein binds to or is localized to a specific genomic DNA sequence in vivo (e.g., in mammalian cells). Using formaldehyde (a process known as “fixation”), DNA-binding proteins are crosslinked to DNA in vivo (i.e., host cells are “fixed” with formaldehyde). Chromatin from the cells is isolated, and the DNA is sheared or restriction-digested into small fragments (some of which are also comprised of crosslinked DNA). Crosslinked DNA-binding proteins are immunoprecipitated using protein-specific antibodies, and so co-immunoprecipitating any attached DNA attached to the proteins. The crosslinking is reversed, and polymerase chain reaction (PCR) is used to amplify specific DNA sequences to identify those that were bound to the protein and co-immunoprecipitated with the antibody. Alternatively, the isolated fragments can be cloned into a plasmid vector for subsequent sequence analysis. Either method provides a population of DNA fragments that are able to interact with the particular DNA-binding protein used. ChIP-PET (Wei et al., 2006) is an enhanced ChIP technique whereby two 18 base-pair sequence tags, one from each end of a DNA fragment isolated by ChIP, are extracted and joined together. The joined tags are then sequenced to identify transcription factor binding sites. Finally, ChIP and ChIP-PET techniques may be enhanced further by hybridizing the extracted sequences to a microarray chip (ChIP-chip) (Ren et al., 2000).
While ChIP and its variants can provide valuable information regarding binding sites for DNA-binding proteins—transcription factors in particular—the methods suffer significant limitations. ChIP analysis requires extensive cellular manipulations with multiple steps that must be optimized for each individual DNA-binding protein to be analyzed. ChIP analysis is also dependent on the ability to express the desired DNA-binding protein in a suitable cell type. The major disadvantage of ChIP techniques is the requirement for highly specific antibodies for each protein to be tested. The immunoprecipitation steps of ChIP analysis can be limited severely by the lack of suitable antibodies specific for the DNA-binding protein, and so may require the creation of an epitope-tagged protein (e.g., incorporating an HA or c-Myc moiety at the C- or N-terminus of the DNA-binding protein). In the absence of an antibody specific for the protein tested, any epitope tag added may be masked when the DNA-binding protein is bound to the DNA, severely inhibiting the ability of the epitope-specific antibody to immunoprecipitate the DNA-binding protein. Because ChIP is performed in a cellular context, the analysis is limited to identifying regulatory elements active only in that particular cell type. In the ChIP-chip procedure, analysis is limited to the regions of genomic DNA present on the microarray chips. Finally, ChIP-chip analysis requires the purchase and maintenance of expensive microarray systems, in addition to experienced personnel to assist in analyzing the results.
Therefore, although certain elements of the present invention bear similarities to existing methods, the methods of the present invention are distinct from other methods in that they involve a stable genomic library present in a plasmid vector and are directed at identifying DNA regulatory elements, not just at identifying a synthetic DNA recognition sequence homolog or an unknown DNA-binding protein.
The technical problem underlying the present invention was therefore to overcome these prior art difficulties, furnishing a system that reliably produces clones bearing interacting DNA-binding proteins and their cognate DNA binding sites, and is suitable for large-scale protein-versus-library screens.
The solution to the technical problem above is provided by the embodiments characterized in the claims.