Worldwide genome sequencing efforts are providing a wealth of information on the sequence and structure of various genomes, and on the locations of thousands of genes. In addition, genome research is yielding a considerable amount of information on gene products and their functions. The next challenges will be in the understanding and interpretation of genomic information. A major limitation in the analysis of genome sequence information to date is the lack of information that has been extracted from genome sequences on the location, extent, nature and function of sequences that regulate gene expression, i.e., gene regulatory sequences.
The cis-acting sequence elements that participate in the regulation of a single metazoan gene can be distributed over 100 kilobase pairs or more. Combinatorial utilization of regulatory elements allows considerable flexibility in the timing, extent and location of gene expression. The separation of regulatory elements by large linear distances of DNA sequence facilitates separation of functions, allowing each element to act individually or in combination with other regulatory elements. Non-contiguous regulatory elements can act in concert by, for example, looping out of intervening chromatin, to bring them into proximity, or by recruitment of enzymatic complexes that translocate along chromatin from one element to another. Determining the sequence content of these cis-acting regulatory elements offers tremendous insight into the nature and actions of the trans-acting factors which control gene expression, but is made difficult by the large distances by which they are separated from each other and from the genes which they regulate.
In order to address the problems associated with collecting, processing and analyzing the vast amounts of sequence data being generated by, e.g., genome sequencing projects, various bioinformatic techniques have been developed. In general, bioinformatics refers to the systematic development and application of information technologies and data processing techniques for collecting, searching, analyzing and displaying data obtained by experiments to make observations concerning biological processes.
One example of such an analysis involves the determination of sequences corresponding to expressed genes (expressed sequence tags, or ESTs) and computerized analysis of a genome sequence by comparison to databases of expressed sequence tags. However, this type of analysis provides information on coding regions only and thus does not assist in the identification of regulatory sequences. Mapping of a particular EST onto a genome sequence and searching the region upstream of the EST for potential regulatory sequences is also ineffective, for several reasons. First, large introns and/or 5′ untranslated regions can separate an EST sequence from its upstream regulatory regions; therefore the genomic region to be searched for regulatory sequences is not clearly defined. Second, searches of a given region of a genome for sequences homologous to transcription factor binding sites will yield numerous “hits” (representing potential regulatory sequences), some of which are functional in a given cell and some of which are not. Thus, such searches will fail to provide unambiguous information as to which of several potential regulatory sequences are active in the regulation of expression of a given gene in a particular cell. Furthermore, it is likely that, with respect to a particular gene, different regulatory regions are functional in different cell types. Therefore, the problem of identifying regulatory sequences for a gene is specific to each cell type in which the gene is (or is not) expressed. Indeed, different regulatory sequences will often be responsible for regulating the expression of a particular gene in different cells.
Thus, the informational content of a gene does not depend solely on its coding sequence (a portion of which is represented in an EST), but also on cis-acting regulatory elements, present both within and flanking the coding sequences. These include promoters, enhancers, silencers, locus control regions, boundary elements and matrix attachment regions, all of which contribute to the quantitative level of expression, as well as the tissue- and developmental-specificity of expression of a gene. Furthermore, the aforementioned regulatory elements can also influence selection of transcription start sites, splice sites and termination sites.
Identification of cis-acting regulatory elements has traditionally been carried out by identifying a gene of interest, then conducting an analysis of the gene and its flanking sequences. Typically, one obtains a clone of the gene and its flanking regions, and performs assays for production of a gene product (either the natural product or the product of a reporter gene whose expression is presumably under the control of the regulatory sequences of the gene of interest). Here again, one encounters the problem that the extent of sequences to be analyzed for regulatory content is not concretely defined, since sequences involved in the regulation of metazoan genes can occupy up to 100 kb of DNA. Furthermore, assays for gene products are often tedious and reporter gene assays are often unable to distinguish transcriptional from translational regulation and can therefore be misleading.
Pelling et al. (2000) Genome Res. 10:874-886 disclose a library of transcriptionally active sequences, derived by cloning chromosomal sequences that are immunoprecipitated by antibodies to hyperacetylated histone H4. This library comprises primarily coding sequences and sequences proximal to the transcription startsite. It does not disclose methods for identifying regulatory sequences, databases of regulatory sequences or uses for databases of regulatory sequences.
It can thus be seen that a major limitation of current comparative genomics and bioinformatic analyses is that they are unable to identify cell-specific regulatory sequences. In light of these limitations, methods for identifying regulatory DNA sequences (particularly in a high-throughput fashion), libraries of regulatory sequences, and databases of regulatory sequences would considerably advance the fields of genomics and bioinformatics.