Eukaryotic genes are regulated by sequence-specific DNA-binding proteins. These factors interact with other proteins to modify and remodel resident chromatin, and ultimately promote the assembly of the transcription pre-initiation complex (PIC) that includes RNA polymerase II and its associated elongation factors. In Saccharomyces cerevisiae, over 400 proteins are involved in some aspect of gene regulation. In vertebrates, this number increases by an order of magnitude.
Each of the hundreds of gene regulatory proteins has a function, and that function depends on its precise location in the genome. For example, nucleosomes package DNA and control its accessibility. A TATA box, which binds TBP and promotes PIC assembly, may be located just inside the nucleosome border where it is sequestered and thus unable to promote PIC assembly. Movement of the nucleosome by as little as 5 by has been shown to promote TATA box utilization in yeast. In higher eukaryotes, the stereospecific arrangement of proteins in an enhancesome is critical for its function. Repositioning of its DNA binding components by a few base pair (bp) changes the rotational setting of the components on the DNA helical surface, resulting in nonfunctional complexes. Knowing the exact position of proteins to near single base-pair resolution provides mechanistic information on how proteins work together to regulate gene expression.
Other types of genomic location information are also quite valuable, and perhaps of more widespread interest. For example, knowing which gene (or promoter region) a regulatory protein binds to provides information about the repertoire of genes that a protein might regulate. Knowing which transcription factors bind to which genes and exactly where they bind is critical in understanding how genes are mis-regulated in human diseases. To this day, it remains largely unknown precisely where gene regulatory proteins are bound throughout a genome.
Chromatin immunoprecipitation (ChIP) has been one of the most widely used assays to measure the occupancy of a protein to a particular genomic location in vivo. This method, however, is associated with several drawbacks. For example, the experimenter is required to know, within a few hundred bp, where the protein is bound in the genome, in that specific primers need to be designed to interrogate that position. In 2000, a technological advancement was made—the development of ChIP-chip. Rather than having to know a priori where a factor might bind, DNA microarray technology allowed virtually an entire genome to be queried (reviewed by Pugh and Gilmour, Genome vol. 2 REVIEWS 1013.Epub, 2001). In essence, the identity of the bound DNA could be identified by its ability to hybridize to an array of pre-defined genomic sequences or probes. Early microarrays consisted of long PCR-generated probes that spanned intergenic regions (in compact genomes like yeast). Later microarrays consisted of oligonucleotides that have been uniformly tiled across an entire genome. Microarray-based ChIP-chip procedures, however, have no mechanism to map individual molecules, and instead measure bulk behavior of populations of DNA molecules.
ChIP-chip is now being replaced by ChIP-seq, where instead of using microarrays to identify the bound DNA, each bound DNA molecule is sequenced. This provides a much more stringent sequence identification than the DNA hybridization approach of microarrays, which can be confounded by cross-hybridization of similar sequences. Early ChIP-seq experiments were limited by the number of sequencing reactions that could be performed, and only achieved a statistical representation of genomic binding, rather than complete saturation coverage. Improvements included concatenations of single “tags” (chIP-SAGE), or Paired-End-diTags of chIP DNA fragments (chIP-PET) (Kim, J. et al., Nat. Methods. 2(1): 47-53, 2005; Chen, J., and Sadowski, I., Proc. Natl. Acad. Sci. USA. 102: 4813-4818, 2005; Bhinge, A. A., et al., Genome Res. 17: 910-916, 2007).
The most significant break-through in ChIP-seq technology came with the development of massively parallel DNA sequencing. The Roche/454 genome sequencer was the first, achieving 300,000 sequencing reads in a single run. Next came Illumina/Solexa, which produced up to 80 million short sequencing reads or tags per run, and Applied Biosystems SOLiD, having a similar short-read capability. The short 25-35 by reads were ideal because 25-35 by of sequence is sufficient to uniquely identify the vast majority of genomic sites. Long sequencing reads only increase cost and provide diminishing marginal returns. Typically 10 million tags are considered to be sufficient.
As powerful as ChIP-seq is, there are technical limitations that preclude its full potential from being realized. A major limitation of ChIP technology is the fragment size and heterogeneity achieved after sonication. Proteins are mapped to their bound DNA fragments, and so the larger and more heterogeneous in size the DNA fragment the larger and more disperse the region that is assigned to the bound protein, making interpolation of the binding location more imprecise. Thus, one may be able to locate a protein only to within ±300 by of its actual bound site, whereas ±1 by is ideal.
A second limitation of ChIP is noise. Noise arises from DNA in the input sample that nonspecifically adheres to the immunoprecipitating resin, and then leaches off during the elution step. Since only a small fraction of a genomic locus is actually crosslinked to the target protein, and since most genomic loci are not bound by the factor, even a small amount of contaminating nonspecific genomic DNA represent a large amount of background. In ChIP-chip experiments, this background decreases the dynamic range of the binding assay, and results in false positives and negatives. The higher resolution inherent in ChIP-seq alleviates this somewhat, but such issues remain. In particular, noise in ChIP-seq requires more sequencing tags to be obtained, which affects costs.