1. Field of the Invention
This invention falls within the general areas of bioinformatics, microarray analysis and DNA-protein binding assays.
2. Description of Related Art
It is known that all cells have genes that are basically on all of the time and others that are inducible or repressible via a number of mechanisms. Virtually all known mechanisms for gene regulation depend upon proteins known as transcription factors that attach to specific sequences of DNA in a region of the gene on the genome. Once these factors are attached to the DNA, they aid the cell in expressing the gene they are near. These regions of DNA are called regulatory regions. The most commonly referenced regulatory region is called a promoter because its existence in the presence of transcription factors promotes expression of the gene.
While some commonly occurring regulatory sequences have been identified (such as the well known TATA box and others), in general, promoter regions for a given gene are difficult to determine.
Since transcription factors are themselves proteins, at some point, they were also the result of gene expression. (Transcription factors can also undergo changes after they are synthesized by the cell—sometimes in response to factors external to the cell. Sometimes the transcription factor is initially inert, and it is the result of these changes which makes the transcription factor active.) In general, genes can form regulatory networks with one gene's protein product acting as a transcription factor for another gene that then acts as a transcription factor for another one, etc. Transcription factors may also act on more than one other gene, and may act to positively or negatively regulate genes that they are near (usually downstream). Finally, transcription factors frequently operate only when bound to other identical or different transcription factors—that is, it may take several transcription factors (which may be the result of different activation events) to turn on or off the expression of a gene.
It is the discovery and understanding of these networks and other control mechanisms of the cell that allows us to rationally design drugs, and to understand how diseased cells differ from normal cells.
To gain a deeper understanding of cellular mechanisms researchers are trying to measure which of the genes of a given cell are activated. Current efforts using micro-array technology attempt to measure mRNA expression level in cells by hybridization with thousands of different DNA probes—each probe is specifically created to match a specific mRNA transcript from a living cell prior to analysis. Since mRNA is much more uniform than the folded protein products (the result of translation), it is much easier to estimate protein expression by measuring mRNA levels than by trying to measure the resulting protein levels.
While this is an incredibly powerful technique, no one experimental method can tell the whole story, and in this case while we can finally see the fact that different cells express mRNA (and consequently proteins) in different quantities and at different times, we still can't see exactly why. In other words, we must still search for causality. In many cases, an experimental condition will trigger a different pattern of mRNA expression, but cells are so complicated, that it is rarely limited to one new gene expressed based on the new conditions. More likely, there is experimental noise that interferes with an already complex response to the stimulus—many genes are frequently expressed in response to even simple stimuli.
While we know that transcription factors are the intermediaries in the process of gene expression, we can't directly tell which ones are responsible for the observed expression pattern. In other words, we still can't get very precise cause and effect data for each gene which would allow us to build up a network of interactions between one transcription factor and genes expressed.
Many bioinformatic approaches are being taken to try and guess what the promoter sequences are for a given set of genes. The premise of these methods is that if we can find regulatory DNA sequences in common between several expressed genes. We can then try and deduce the transcription factors. Suppose that three hypothetical genes shown in FIG. 11B are all found to be expressed under the same conditions. Their genomic DNA (including the regulatory and the expressed regions of the genes) shows that the region marked “promoter region” (120) is the same in all three genes.
Such a determination may have been made, for example, by a bioinformatic computer program which searched for this particular sequence, or for sequences in common. This determination requires knowing the DNA sequence around the expressed region of the gene so that such analysis can be performed. Many organisms that are subjected to microarray analysis have full genome sequences available (yeast, Arabidopsis, fruit fly, etc.), however, this is far from a solved problem. While the above example may have been solvable by eye, reality is never this simple. For the following reasons among others:                Regulatory sequences can be thousands of bases long, and the problem of determining a common sequence (consensus sequence) may not be unique.        Genes that are expressed together may share a transcription factor, but they may not. More importantly, there are so many genes being expressed in a cell at any given time, even if we have a snapshot of these genes, it is not clear which ones are related—there are literally thousands of genes expressed at any given time in some cells—all of which might be related.        Those genes that are not expressed might also be related but some other transcription factor acting in a repressive manner could be counteracting the effects of the transcription factor that would otherwise help to promote the expression.        Genes are expressed by a complicated set of transcription factors—sometimes in an all-or-nothing kind of relationship. It is virtually impossible to reliably determine all of the factors especially when you don't know the a priori relationship. (see Arnone and Davidson, Development 124:151-1864)        Many promoter sequences can vary slightly by one or more bases and still be identified by the same transcription factor—making the similarity determination much more complicated than the above example.        We are not just trying to determine the transcription factors, we are trying to determine the relationship between the transcription factors as well. Either one would be easier to determine if the other was known.        
Deducing a transcription factor from a regulatory sequence is probably an even more difficult problem than finding a consensus sequence. Even if you have the crystal structure of the transcription factor you are looking for (a costly and time consuming endeavor which isn't performed for all proteins), knowing the DNA sequence that it binds to is difficult to determine. This is like trying to tell what a key looks like by looking at a lock—while it is possible, it is still very difficult.
Current experimental methods use a method known as footprint analysis to determine the interaction between transcription factors and particular DNA sequences (see Ausubel, et al. (eds.) Short Protocols in Molecular Biology (John Wiley & Sons 1999)). This is also slow, tedious, and generally focuses on only one factor at a given time. It also requires significantly narrowing the sequence a priori and is therefore useful more as a hypothesis confirmation method than as a discovery method.
Microarray analyzers are commercially available now, and are covered by many existing patents, and public knowledge in the scientific literature (for example Shalon et al. 1996 Genome Res. 6:639-45, Schena et al. 1995 Science 270:467-470, and U.S. Pat. No. 5,143,854). DNA-protein binding detection assays using radioactive and fluorescent labeling are available as common molecular biology techniques, and are also widely described in the literature.
Practitioners of the instant invention will, generally, have an advanced degree in one or more disciplines including, but not limited to: molecular biology, biochemistry, biomathematics, bioinformatics, biostatistics, genetics, genetic engineering, computer science etc. Alternately, a practitioner will have an advanced science or engineering discipline related to the design, construction, or utilization of, or interpreting results from, systems or devices (or programming of the control software further comprising such devices or systems) utilized in such disciplines, including but not limited to: gene sequencers, microarray analyzers, and the like.
It is within the ken of those skilled in the art to design, construct, utilize, interpret results from, systems or devices (or programming of the control software further comprising such devices or systems) utilized in practicing the instant invention.
Rather than repeat details of that which known in the art, the instant application will focus on the novel arrangements and combinations of known processes and devices, and the wholly novel processes and devices, employed to practice the instant invention.