DNA sequence data has been rapidly accruing. Still, completion of the Human Genome Project and other genome sequencing projects has shown that DNA sequence data provides only a partial picture of gene function. What is missing is a full understanding of what factors trigger gene expression, as well as a full understanding of the temporal profile of gene expression, and how particular genes may interact with each other.
Recently, DNA microarrays have become a valuable tool for exploring gene expression. DNA microarrays may be formed using fragments of genomic DNA, DNA pools generated by cloning or other amplification and/or selection techniques, or even short stretches of DNA known as oligonucleotides.
A single microarray chip may yield expression profiles for thousands of genes. Still, the data provided by such arrays, while fairly straightforward to generate, may provide significant challenges with respect to analysis. Often, in order for this type of research to be most productive, thousands of data points need to be directly compared in a single experiment. For example, in some cases it may be important to compare gene expression profiles over time, thus multiplying the number of data points generated from a single array by the number of time points measured.
Also, for many studies, the goal is to determine the cause-and-effect relationship by which particular genes are expressed. For example, expression of one gene may inhibit, enhance, or have no effect on the expression of a second gene. Alternatively, expression of one gene may influence another gene, but there may be a lag period. Also, the expression profile for a particular gene may be modified by changes in the environment, such as a physical changes in the cell or as a result of chemical signals inducing or inhibiting expression. Extracting this type of information from array data can be a challenging task, especially when the identity and function of most of the genes under study is unknown.
To date, several methods have been developed to analyze array data including K-Means, principal component analysis (PCA), and self-organizing maps (SOM). None of the techniques currently being used are completely optimized for this type of analysis, however.
A main objective of microarray data analysis is to identify the “independent” clusters of genes, such that the genes belonging to the gene cluster have similar expression patterns that may be involved in, or required for, a specific physiological response. For example, it is expected that there may be one subset of genes required for cholesterol metabolism, a second subset involved in the immune response, and yet another subset of genes involved in the development of cancer. It would be of interest to identify these pools of genes to develop a better understanding of these processes and to identify targets for potential therapeutic agents.
In most existing microarray processing technologies, the process of selecting the number of gene groups that describes the number of independent pathways is left up to the user. Incorrectly selecting the number of independent groups can skew the analysis such that vastly different results are generated depending upon how many independent gene groups are assumed.
Thus, what is needed is methods and systems to analyze genomic expression data in an effective manner. Such systems and methods preferably will comprise computerized statistical techniques. In this way, the data may be analyzed in a way that provides meaningful results. Also, what is needed is a way to describe the interrelationship between genes in a group.