There have been efforts to describe the regulatory landscape of the human genome using high-throughput or computational methods. Individual labs as well as the ENCODE project, for example, have provided a genome-wide catalogue of DNA elements in the human genome. Shown in FIG. 1 are certain examples of these efforts. Regions that have been identified by high-throughput methods such as ChIP-Seq are sometimes called DNA elements because these sequences have been shown to bind a protein but their effect on gene expression is unknown. For example, as shown in FIG. 1, data type A (reference 152 in FIG. 1) are regulatory regions identified by low-throughput experiments and data type B (154) are DNA elements identified by high-throughput studies.
Among other things, computational analyses of these data have identified patterns of chromatin modification that mark transcriptionally active regions, providing a global view of putative regulatory elements in the human genome. Recent efforts have included the GenotypeTissue Expression (GTEx; www.commonfund.nih.gov/GTEx/) program to identify eQTLs, variations that are associated with changes in gene expression (see FIG. 1, association 118-12 between data types C (polymorphisms 156) and F (expression 162)). Although high-throughput experiments provide broad coverage, the direct mechanism of regulation between these nucleotides and their target gene may not have been identified. The availability of genome sequences of other mammals has identified regions of high conservation in non-protein coding regions, leading to the identification of regulatory regions essential for development (See FIG. 1, data type E (conservation blocks and computational predictions 160). Additionally, computational algorithms can predict regions that are required for gene regulation, such as regions in the 3′ UTR that are essential for regulation by microRNAs (see FIG. 1, data type E (conservation blocks and computational predictions 160). These computational datasets can provide hypotheses of functional nucleotides but their biological significance must still be evaluated manually. Shown in FIG. 1 are archetypes of biological data and the current associations between them.
Sequence information, associations between two data types, and a catalog of DNA elements in the human genome alone offer little to scientists and clinicians unless it is associated with functional information. Much of the current knowledge about the role of nucleotides in intergenic and non-coding regions in transcriptional and translational regulation is known through directed experimental studies that are published in peer-reviewed journals. Regions that have been demonstrated to have an effect on protein-nucleic acid interactions, nucleic acid-nucleic acid interactions, or gene expression using mutagenesis and reporter experiments will be referred to as regulatory or functional regions even though their effect on gene function may be limited.
Problematically, it is not easy to identify the relevant literature by searching databases such as PubMed. For example, finding all the papers that identify regulatory regions for the beta-globin locus that contains several developmentally regulated hemoglobin genes or the transcription factor STAT3 in PubMed is not possible with a single query. In February 2011, a search of “beta-globin (with all symbols, names, and aliases) and regulation” of all PubMed records indexed with a “humans” MeSH term retrieved 1334 publications. Only 13% of these papers (177) contain information providing nucleotides or coordinates for regions necessary for repression and activation at the beta-globin locus; the rest discuss post-translational regulation of the proteins required for beta-globin expression. A similar search for STAT3 found 167 out of 1722 papers (9%) that contain information identifying specific nucleotides in STAT3 binding sites or regions that regulated STAT3. Finding and reading these papers on intergenic and non-coding regions is not a feasible task for scientists or clinicians who wish to identify functional nucleotides in hundreds if not thousands of non-coding regions. Even if the papers can be identified, the data cannot easily be integrated in an analysis pipeline.
As sequencing costs drop, full genome sequencing has become possible. Genome sequencing centers predict 30,000 human genome sequences will be available by the end of 2011. But non-coding regions represent 99% of the entire human genome and little is known about many variants already identified in GWAS studies (see FIG. 1, connection 168-3 between data types D (gene function 158) and G (diseases and phenotypes 164)). Analysis and understanding of variation in the human genome has largely focused on protein-coding regions because existing tools and databases have annotated the biological function of genes and their role in biological pathways and disease processes. The Swiss-Prot records at UniProtKB contain annotations of amino acids that are located in the active site or are involved in interactions with other proteins or ligands from the experimental literature. PolyPhen, a widely-used prediction algorithm that identifies deletrious non-synonymous SNPs (nsSNPs), uses these literature-curated data to aid the prediction. Such resources have allowed exome sequencing to become successful as a diagnostic tool to identify genetic causes of rare diseases.
Increased availability of regulatory nucleotides from directed experimental investigations can directly annotate variants identified in GWAS studies and provide biological context to high-throughput and computational datasets, but it can also provide additional information to variants that are in linkage disequilibrium. Therefore, even if the specific SNP identified in a GWAS study has not been studied in the biomedical literature, annotations for regulatory nucleotides in linkage disequilibrium may implicate genes and pathways that contribute to the pathophysiology of disease. In order to accomplish this, regulatory elements in intergenic and non-coding regions must be integrated with high-throughput datasets that describe DNA elements, and regions of sequence variation throughout the genome as well as with annotations that provide functional information and clinical relevance of the genes that are being regulated. In addition, tools and analysis pipelines will need to be developed in order to facilitate the annotation of affected SNPs as well as the identification of relevant SNPs, biological processes, and diseases for results of GWAS studies and whole-genome sequencing.
With the costs of DNA sequencing decreasing, the number of genomes from both healthy and disease tissues is rapidly increasing from the reference genome in 2003, to 16 in 2009, to 50-300 in 2010 to an estimated 30,000 in 2011. A major challenge ahead is to interpret genome sequences and to identify variants responsible for normal and disease phenotypes. At present, most efforts have focused on the identification of changes in protein-coding genes and microRNAs (miRNAs) where deleterious alterations can sometimes be deduced. For example, analysis of the Quake genome and, more recently, those of ten other healthy individuals have revealed numerous changes in the protein-coding genes. But most variations from genome sequences as well markers from genome wide association studies (GWAS) identify nucleotide and structural variants that lie outside of coding sequences, and generally these variants are not interpreted. An analysis of dbSNP in March 2011 revealed that approximately 95% of currently known variants are located in non-protein coding regions but fewer than 0.1% have been associated with a publication.
In the last several decades a great deal of information has been generated to analyze regulatory and non-coding sequences in the genome. Initially this information involved analysis of individual genes through mutagenesis, analysis of elements in reporter and/or biochemical assays such as “gel shift” or Chromatin Immunopreciptation (ChiP). With the advent of genomic approaches, in the past decade, high-throughput studies have been implemented to map regulatory elements on a global scale. These include ChiP-chip (Chromatin immunoprecipitation followed by microarray analyses) or ChiP-Seq (Chromatin immunoprecipitation followed by DNA sequencing) to identify targets throughout the genome and expression quantitative trait loci (eQTL) studies to map potentially regulatory or associated SNPs using changes in expression in a cell or tissue. Currently systematic efforts to collect such information through the ENCODE project have generated approximately 500 ChiP Seq datasets, and this count does not include the significantly large number of datasets generated by individual laboratories not part of the ENCODE project. Presently there is no single resource that houses all of the low-throughput data from individual labs as well as the global data from individual labs as well as consortia. Such a resource would be valuable for the interpretation of variants from large-scale projects such as the HapMap project and Cancer Genome Atlas (www.cancergenome.nih.gov/), as well as the personal genome sequencing efforts going on all around the world, for example.
Therefore, there is a need in the art to associate functional information with non-protein coding variants, so that variants from personal and disease genome sequences as well as GWAS studies can be evaluated by researchers for phenotypic and disease potential.