Gene expression microarrays are becoming very important devices, as they allow researchers to test large quantities of genetic material. These devices are important because they allow researchers to analyze cells, and the genetic information of the cells, to determine whether the cells belong to particular phenotypes. A phenotype is the determinable characteristics of an organism, as determined by genes and the relationship of the genes to the environment. Researchers are currently examining such phenotypes as cancer, diabetes and other diseases. Gene expression microarrays are increasing in importance in this research.
There are different gene expression microarray technologies. In typical implementations, gene expression microarrays contain thousands of DeoxyriboNucleic Acid (DNA) molecules that represent many genes. These DNA molecules are placed on discrete spots on the microarray. Each of these DNA molecules may be thought of as part of an “unzipped” piece of a gene that is waiting for a complement to which it will be “zipped.” The DNA molecules attached to the microarrays are commonly called probes.
The complements, also called targets, generally come from messenger RiboNucleic Acid (mRNA), which are basically the working copies of genes within cells. When testing a cell from an organism, the mRNA from a particular sample is purified and a marker is attached to it. The mRNA is added to the gene expression microarray and the mRNA “hybridizes” with the DNA to which it is a complement. Thus, some of the discrete spots will contain mRNA hybridized to DNA, and other spots will not contain mRNA hybridized to DNA. For clarity, “targets” will be referred to as mRNA herein. In some technologies, however, the cellular mRNA is reverse-transcribed into complementary DNA (cDNA), which are complementary copies of otherwise fragile mRNA. The cDNA is linearly amplified and is subsequently used to hybridize with the probes.
The marker attached to the mRNA is used to determine which spots contain mRNA hybridized to DNA. Usually, the markers are fluorescent molecules that fluoresce when a laser light of an appropriate frequency and power shines on them. This fluorescence can be measured.
The fluorescence is a measure of how much a gene “expresses” itself. If there is a high fluorescence for a particular gene, this means that the gene is very active. Conversely, if there is low or no fluorescence for a gene, this means that the gene is inactive. Thus, by examining the fluorescence of the microarray, researchers can determine the degree of activity of the different genes.
This method is advantageous in that it is possible to determine gene function for tissue affected by a disease as compared to tissue not affected by such disease. By comparing the two phenotypes, researchers can in principle determine which genes contribute to certain diseases and how they contribute.
When making these determinations, it is helpful to examine genes from diverse groups of both people who have a disease and people who do not have this disease (hereafter called “healthy,” even though they may be affected by other conditions). Because people are different, there will be differences in the expression level of genes between subjects in the group. These differences occur in both healthy and sick individuals. These differences will be apparent during microarray analysis of samples for the various people in the group. For instance, one person could have an enzymatic deficiency (unrelated to the disease under study) which causes a set of genes to be less expressive. Another person may not have such deficiency and his or her corresponding genes express themselves at a much higher level. Therefore, even for healthy people, there is a variation in gene expression.
Because of these variations, which are further compounded with errors in the experimental measurements, usually many microarray samples are taken and analyzed. The microarray results from this data can be used to make statistical analyses with the aim of comparing sick cells and their genes with healthy cells and their genes. From these analyses, researchers attempt to determine which genes actually relate to the disease.
One method of determining this is to look for patterns in the data. For example, perhaps one particular gene is turned on in a sick cell, while another gene is turned off. This “pattern” can be determined because the expression of one gene will be low, while the expression of another gene will be high. Moreover, as previously discussed, the researchers generally compare the expressions from the genes of the unhealthy phenotype with the expressions from the genes with the healthy phenotype. It helps to compare genes from unhealthy cells with genes from healthy cells, as the healthy cells provide a baseline. For instance, perhaps a certain gene is almost always turned on in normal cells. Even though a cell exhibiting an unhealthy phenotype might also have this gene turned on, because normal cells also have this gene turned on, it is likely that this gene does not relate to the disease being researched. However, if the unhealthy cell has this gene turned off and healthy cells generally have this cell turned on, then it could be that this gene does relate to the disease being researched.
There are various ways of analyzing gene expression data. For a recent review on the methods, see “Genetic Network Inference: From Co-expression Clustering to Reverse Engineering,” Bioinformatics, August 2000, 16(8), 707-26, the disclosure of which is incorporated by reference herein. Most of these methods analyze the fluorescent outputs for a number of samples of a gene to determine an “average” fluorescence for the gene. These values are indicative of the expression level of a gene. With a number of different genes, a “pattern” of these expression level averages can be made. Typically, however, the fluorescence of a given gene across several experiments can vary tremendously. In this situation, the determined average is meaningless. Some researchers attempt to alleviate this problem by determining the average fluorescence value for a particular gene, evaluating the standard deviation of the fluorescence value for the variety of samples for the gene and, from this data, determining a normalized distribution in which the standard deviation is one. The limitation of this method is that it treats genes as independent, neglecting the natural correlations between genes.
Thus, what is needed is a way of comparing expressions from samples being examined with expression levels from control samples and a better way of detecting gene-gene correlations when searching for patterns.