Cancer, the abnormal and uncontrolled division of cells causing benign or malignant growth, is the second leading cause of death in the U.S., exceeded only by heart disease. The early detection, classification, and prognosis of cancer remains the best way to ensure patient survival and quality of life.
Many current techniques to predict and classify cancer use deoxyribonucleic acid (DNA) microarray data, and, in particular, gene expression data, which allows for testing large quantities of genetic material. Generally, using the data, cells can be analyzed, genetic information can be extracted, and phenotypes can be determined, i.e., the characteristics of an organism, as determined by its genes and the relationship of the genes to the environment.
Typically, DNA molecules are placed on a microarray and attached using probes. Each of these DNA molecules may include a complement, generally referred to as a target or messenger ribonucleic acid (mRNA), which is, generally speaking, the working copy of genes within cells. When testing a cell from an organism, the mRNA is added to the microarray where the mRNA “hybridizes” with the DNA to which it is a complement. Thus, some of the discrete spots will contain mRNA hybridized to DNA, while other spots will not. In some technologies, the cellular mRNA is reverse-transcribed onto complementary DNA (cDNA), which is a complementary copy of the mRNA. The cDNA is linearly amplified and is subsequently used to hybridize with the probes.
The marker attached to the mRNA is used to determine which spots contain mRNA hybridized to DNA. Typically, the markers are fluorescent molecules that fluoresce when exposed to laser light of an appropriate frequency and power. This fluorescence can be measured to determine the degree to which a gene has expressed. For example, a high fluorescence detected for a particular gene indicates that the gene is very active. Conversely, a low or no fluorescence for a gene indicates that the gene is inactive. Thus, by examining the fluorescence of the microarray, the degree of activity of the different genes can be determined.
Using current techniques, it is possible to determine gene function for tissues affected and unaffected by a disease. By comparing the two phenotypes, it can be determined which genes contribute to certain diseases and how they contribute. However, current techniques are limited. In particular, the detection and classification of gene expression data is based on only a few observations. In many cases, only a few underlying gene components account for much of the data variation. For example, only a few linear combinations of a subset of genes may account for nearly all of the response variation. Unfortunately, it is exceedingly difficult to determine which genes are members of the subset given the number of genes in the microarray and the small number of observations. As such, given the number of genes on a single microarray (usually is in the thousands), a relatively small number of observations cannot provide accurate statistical data.