The present disclosure relates to the fields of bioinformatics, gene regulation, gene regulatory sequences, gene regulatory proteins and methods of determining gene regulatory pathways.
Worldwide genome sequencing efforts are providing a wealth of information on the sequence and structure of various genomes, and on the locations of thousands of genes. In addition, genome research is yielding a considerable amount of information on gene products and their functions. The next challenges will be in the understanding and interpretation of genomic information. A major limitation in the analysis of genome sequence information to date is the lack of information that has been extracted from genome sequences on the location, extent, nature and function of sequences that regulate gene expression, i.e., gene regulatory sequences.
The cis-acting sequence elements that participate in the regulation of a single metazoan gene can be distributed over 100 kilobase pairs or more. Combinatorial utilization of regulatory elements allows considerable flexibility in the timing, extent and location of gene expression. The separation of regulatory elements by large linear distances of DNA sequence facilitates separation of functions, allowing each element to act individually or in combination with other regulatory elements. Non-contiguous regulatory elements can act in concert by, for example, looping out of intervening chromatin, to bring them into proximity, or by recruitment of enzymatic complexes that translocate along chromatin from one element to another. Determining the sequence content of these cis-acting regulatory elements offers tremendous insight into the nature and actions of the trans-acting factors which control gene expression, but is made difficult by the large distances by which they are separated from each other and from the genes which they regulate.
In order to address the problems associated with collecting, processing and analyzing the vast amounts of sequence data being generated by, e.g., genome sequencing projects, various bioinformatic techniques have been developed. In general, bioinformatics refers to the systematic development and application of information technologies and data processing techniques for collecting, searching, analyzing and displaying data obtained by experiments to make observations concerning biological processes.
One example of such an analysis involves the determination of sequences corresponding to expressed genes (expressed sequence tags, or ESTs) and computerized analysis of a genome sequences by comparison to databases of expressed sequence tags. However, this type of analysis provides information on coding regions only and thus does not assist in the identification of regulatory sequences. Mapping of a particular EST onto a genome sequence and searching the region upstream of the EST for potential regulatory sequences is also ineffective, for several reasons. First, large introns and/or 5xe2x80x2 untranslated regions can separate an EST sequence from its upstream regulatory regions; therefore the genomic region to be searched for regulatory sequences is not clearly defined. Second, searches of a given region of a genome for sequences homologous to transcription factor binding sites will yield numerous xe2x80x9chitsxe2x80x9d (representing potential regulatory sequences), some of which are functional in a given cell and some of which are not. Thus, such searches will fail to provide unambiguous information as to which of several potential regulatory sequences are active in the regulation of expression of a given gene in a particular cell. Furthermore, it is likely that, with respect to a particular gene, different regulatory regions are functional in different cell types. Therefore, the problem of identifying regulatory sequences for a gene is specific to each cell type in which the gene is (or is not) expressed. Indeed, different regulatory sequences will often be responsible for regulating the expression of a particular gene in different cells.
Thus, the informational content of a gene does not depend solely on its coding sequence (a portion of which is represented in an EST), but also on cis-acting regulatory elements, present both within and flanking the coding sequences. These include promoters, enhancers, silencers, locus control regions, boundary elements and matrix attachment regions, all of which contribute to the quantitative level of expression, as well as the tissue- and developmental-specificity of expression of a gene. Furthermore, the aforementioned regulatory elements can also influence selection of transcription start sites, splice sites and termination sites.
Identification of cis-acting regulatory elements has traditionally been carried out by identifying a gene of interest, then conducting an analysis of the gene and its flanking sequences. Typically, one obtains a clone of the gene and its flanking regions, and performs assays for production of a gene product (either the natural product or the product of a reporter gene whose expression is presumably under the control of the regulatory sequences of the gene of interest). Here again, one encounters the problem that the extent of sequences to be analyzed for regulatory content is not concretely defined, since sequences involved in the regulation of metazoan genes can occupy up to 100 kb of DNA. Furthermore, assays for gene products are often tedious and reporter gene assays are often unable to distinguish transcriptional from translations regulation and can therefore be misleading.
Pelling et al. (2000) Genome Res. 10:874-886 disclose a library of transcriptionally active sequences, derived by cloning chromosomal sequences that are immunoprecipitated by antibodies to hyperacetylated histone H4. This library comprises primarily coding sequences and sequences proximal to the transcription startsite. It does not disclose methods for identifying regulatory sequences, databases of regulatory sequences or uses for databases of regulatory sequences.
It can thus be seen that a major limitation of current comparative genomics and bioinformatic analyses is that they are unable to identify cell-specific regulatory sequences. In light of these limitations, methods for identifying regulatory DNA sequences (particularly in a high-throughput fashion), libraries of regulatory sequences, and databases of regulatory sequences would considerably advance the fields of genomics and bioinformatics.
Disclosed herein are compositions and methods useful for designing exogenous regulatory molecules for regulating a gene of interest. These compositions and methods are useful for facilitating processes that depend upon gene expression. Regulatory molecules are any molecules that facilitate expression or repression of the gene of interest.
Accordingly, in one aspect, methods for designing one or more exogenous regulatory molecules for regulating a gene of interest are described. In certain embodiments, the methods comprise (a) providing polynucleotide sequences (or collections of polynucleotide sequences), each sequence (or collection comprising a plurality of polynucleotide sequences) corresponding to accessible regions of cellular chromatin in a sample; (b) identifying one or more sequence elements in the polynucleotides or collection of polynucleotides, wherein the one or more sequence elements are potential regulatory sequences for the gene of interest; and (c) preparing an exogenous regulatory molecule that comprises a DNA binding domain and a functional domain that activates or represses transcription of the gene of interest, wherein said preparing comprises selecting the DNA binding domain, the functional domain or both the DNA binding domain and the functional domain based upon the identified sequence elements. In embodiments in which collections of polynucleotides (e.g., libraries) are used, the collections can be stored on a computer-readable medium and the identifying can be performed with a computer.
In one embodiment, the identifying of potential regulatory sequences comprises identifying a gateway accessible region; and the selecting comprises choosing the DNA binding domain (e.g, a zinc finger DNA-binding domain) of the exogenous regulatory molecule to specifically bind to a segment of the gateway accessible region. In other embodiments, the identifying of potential regulatory sequences comprises identifying a functional accessible region and determining whether the functional accessible region comprises a binding site for a transcription factor (e.g., a zinc finger binding site); and the selecting comprises choosing the functional domain of the exogenous regulatory molecule to be the same as the functional domain of the transcription factor.
The polynucleotides of the methods described herein (or collection of polynucleotide sequences) can be obtained in a variety for ways, for example, by (a) treating cellular chromatin with a chemical or enzymatic probe wherein the probe reacts with accessible polynucleotide sequences; (b) fragmenting the treated chromatin to produce polynucleotide fragments (or a collection of polynucleotide fragments), wherein the polynucleotide fragments (or collection) comprises marked polynucleotides and unmarked polynucleotides, and wherein each marked polynucleotide contains one or more sites of probe reaction; (c) collecting marked polynucleotides, wherein the marked polynucleotides comprise polynucleotide sequences present in accessible regions of cellular chromatin; and (d) determining the nucleotide sequences of the marked polynucleotides to obtain the polynucleotide sequences (or collection of sequences) corresponding to accessible regions related to the gene of interest.
In other embodiments, the polynucleotide sequences (or collection of polynucleotide sequences) are obtained by a method that comprises (a) treating cellular chromatin with a methylase to generate methylated chromatin; (b) deproteinizing the methylated chromatin to form deproteinized chromatin; (c) digesting the deproteinized chromatin with a methylation-dependent restriction enzyme to produce restriction fragments (or a collection of restriction fragments), wherein the restriction fragments (or collection) comprises methylated polynucleotides and non-methylated polynucleotides; (d) collecting non-methylated polynucleotides, wherein the termini of the non-methylated polynucleotides correspond to accessible regions of cellular chromatin; and (e) determining the nucleotide sequences of the termini of the non-methylated polynucleotides to obtain the polynucleotide sequences (or collection of polynucleotide sequences).
In yet other embodiments, the polynucleotide sequences (or collection of polynucleotide sequences) are obtained by a method that comprises: (a) treating cellular chromatin with a methylase to generate methylated chromatin; (b) deproteinizing the methylated chromatin to form deproteinized chromatin; (c) digesting the deproteinized chromatin with a methylation-dependent restriction enzyme to produce restriction fragments (or a collection of restriction fragments), wherein the fragments (or collection) comprise methylated polynucleotides and non-methylated polynucleotides; (d) collecting methylated polynucleotides, wherein the methylated polynucleotides correspond to accessible regions of cellular chromatin; and (e) determining the nucleotide sequences of the methylated polynucleotides to obtain the polynucleotide sequences (or collections of polynucleotides).
In still further embodiments, the polynucleotide sequences (or collection of polynucleotide sequences) are obtained by a method that comprises: (a) treating cellular chromatin with a nuclease; (b) collecting polynucleotide fragments released by nuclease treatment, wherein the released polynucleotide fragments are derived from accessible regions of cellular chromatin; and (c) determining the nucleotide sequences of the released polynucleotide fragments to obtain the polynucleotide sequences (or collection of polynucleotide sequences).
In other embodiments, the polynucleotide sequences (or collection of polynucleotide sequences) are obtained by a method that comprises: (a) treating cellular chromatin with a methylation-sensitive enzyme that cleaves at unmethylated CpG sequences; (b) collecting short polynucleotide fragments released by enzyme treatment; wherein the polynucleotide fragments are derived from regulatory regions of cellular chromatin; and (c) determining the nucleotide sequences of the released polynucleotide fragments to obtain the polynucleotide sequences (or collection of polynucleotide sequences).
In other embodiments, the polynucleotide sequences (or collection of polynucleotide sequences) are obtained by a method that comprises: (a) treating cellular DNA with an agent that selectively cleaves AT-rich sequences; (b) collecting large polynucleotide fragments released by the treatment; wherein the large polynucleotide fragments comprise regulatory regions; and (c) determining the nucleotide sequences of the large polynucleotide fragments to obtain the polynucleotide sequences (or collection of polynucleotide sequences).
In other embodiments, the polynucleotide sequences (or collection of polynucleotide sequences) are obtained by a method that comprises: (a) treating cellular DNA with an agent that selectively cleaves AT-rich sequences to form a mixture of methylated and unmethylated fragments enriched in CpG islands; (b) separating the unmethylated fragments from the methylated fragments to obtain a collection of unmethylated fragments enriched in CpG islands, wherein the unmethylated fragments are derived from regulatory regions of cellular chromatin; and (c) determining the nucleotide sequences of the unmethylated fragments to obtain the polynucleotide sequences (or collection of polynucleotide sequences).
In other embodiments, the polynucleotide sequences (or collection of polynucleotide sequences) are obtained by a method that comprises: (a) fragmenting chromatin; (b) contacting the fragments with an antibody that specifically binds to acetylated histones, thereby forming an immunoprecipitate enriched in polynucleotides corresponding to accessible regions; (c) collecting the polynucleotides from the immunoprecipitate; and (d) determining the nucleotide sequences of the collected polynucleotides to obtain the polynucleotide sequences (or collection of polynucleotide sequences).
In still further embodiments, the polynucleotide sequences (or collection of polynucleotide sequences) are obtained by a method that comprises: (a) reacting cellular chromatin with a chemical or enzymatic probe to generate chromatin-associated DNA fragments, wherein the DNA fragments comprise, at their termini, sites of probe reaction which identify accessible regions of cellular chromatin; (b) attaching an adapter polynucleotide to the termini generated by the probe to generate adapter-ligated fragments; and (c) amplifying the adapter-ligated fragments in the presence of a first primer that is complementary to the adapter and a second primer that is complementary to a segment of a gene of interest to form one or more amplified products; and (d) determining the nucleotide sequences of the amplified products to obtain the polynucleotide sequences (or collection of polynucleotide sequences). In some instances, a plurality of second primers, each complementary to a segment of a different gene of interest, are used, to generate a plurality of amplification products.
In other embodiments, the polynucleotide sequences (or collection of polynucleotide sequences) are obtained by a method that comprises: (a) reacting cellular chromatin with a chemical or enzymatic probe to generate chromatin-associated DNA fragments, wherein the DNA fragments comprise, at their termini, sites of probe reaction which identify accessible regions of cellular chromatin; (b) attaching a first adapter polynucleotide to the termini generated by the probe to generate adapter-ligated fragments; (c) digesting the adapter-ligated fragments with a restriction enzyme to generate a population of digested fragments, wherein the population comprises digested fragments having a first end that comprises the first adapter and a second end formed via the activity of the restriction enzyme; (d) contacting the digested fragments with a primer complementary to the first adapter under conditions wherein the primer is extended to generate a plurality of extension products, each comprising a first end that comprises the first adapter and a second end that can be attached to a second adapter polynucleotide; (e) joining the second adapter to the second end of each of the plurality of extension products to form a plurality of modified fragments, each of which comprises the first and second adapters at its first and second end, respectively; (f) amplifying the plurality of modified fragments in the presence of primers complementary to the sequences of the first and second adapters to generate a population of amplified products comprising sequences corresponding to accessible regions of cellular chromatin; and determining the nucleotide sequences of the amplified products to obtain the polynucleotide sequences (or collection of polynucleotide sequences).