High-throughput genome sequencing and bioinformatics technologies have dramatically eased the task of genomic annotation, producing parts lists of living organisms as simple as Mycoplasmas and as complex as mammals. What took decades of work in the past can now be completed in a few months1. Further progress in understanding of an organism's biology requires development and refinement of techniques to determine the dynamic interactions among an organism's molecular parts2. A major difficulty of this task is the context-specific nature of gene regulation. The total space of possible transcriptional regulatory interactions for an organism is the number of transcription factors multiplied by the number of genes multiplied by the number of environmental contexts in which the cell might find itself. Methods to identify regulatory interactions must efficiently determine the thousands of true regulatory interactions out of the billions of possible ones.
Pioneering efforts to identify regulatory interactions on a genome-scale have used machine-learning algorithms to identify cis-regulatory motifs or transcription factor target genes using a large set of expression arrays3-18, genome-wide location analysis (ChIP-Chip)19,20, or a combination of these and other high-throughput methods21-26. In general, the accuracy of these methods has been evaluated by testing for functional enrichment of co-regulated genes, experimental confirmation of selected regulatory relationships, or cross-validation within the training data set. However, rigorous validation of the accuracy of these methods at the genome scale has remained elusive due to the lack of a model organism with both a known regulatory structure and compatible experimental data. Therefore the relative merits and broader utility of these approaches remain difficult to judge.