The disclosed subject matter relates to a technique for quantifying the sequence specificity of nucleotide binding factors.
The sequence-specific regulatory activity of a nucleic acid binding trans-acting factor (“TF”)—such as a DNA-binding protein, an RNA-binding protein, a non-coding RNA molecule such as a micro-RNA, or an RNA-protein complex, or a synthetic molecules such as a polyamide—is the result of energetically favorable interactions between amino acids exposed in the DNA binding domain of the TF and portions of nucleobases exposed in the major and/or minor groove of the DNA, or exposed by RNA. A computational method for discovering the sequence specificity of a TF cannot provide a quantitative description of TF binding unless it considers the physical underpinnings of the TF-DNA (or TF-RNA) interaction.
Current computational methods typically discover over-represented patterns in a set of nucleotide sequences considered to be bound by the TF. Such techniques use the information content of nucleotide patterns as a proxy for the free energy contributions of the bases found in the TF binding site. Other computational methods infer physically-based TF binding specificities from measured TF binding affinities for a small set of oligonucleotides or from structural modeling of protein-DNA interaction.
Genome-scale, quantitative measurements of TF occupancy are now available due to the advent of in vivo methods such as chromatin immunoprecipitation microarrays (“ChIP-chip”) or DNA adenine methylase profiling (“DamID”), or in vitro methods such as protein binding microarrays (“PBM”) and DNA immunoprecipitation microarrays (“DIP-ChIP”). Thus, it is no longer necessary to rely on small data sets, availability of protein-DNA structures, or the analogy between information content and statistical mechanics to infer free energy representations of transcription factor binding sites.
Unfortunately, the current techniques are limited in the ability to quantify the affinity of binding factors towards nucleotide sequences in a sequence specific manner. Such limitations arise from the inability to process large nucleotide sequences or inadequate structural modeling of the interactions between binding factors and nucleotide sequences. Accordingly, there exists a need for a technique for predicting binding affinities for nucleotide sequences.