The present invention relates to the field of computer systems. Specifically, the present invention relates to computer systems for the analysis and manipulation of gene expression data. Advances in the genomics area, specifically in the development of the microarray (Schena et al., Science 270: 467-470 (1995)) and GeneChip.RTM. (Lockhart et al., Nature Biotech. 14: 1675-1680 (1996)) technologies, require new bioinformatics tools for the manipulation, analysis and processing of gene expression data. Many disease states and related conditions are characterized by differences in the expression levels of various genes. These differences may occur through changes in the copy number of DNA or through changes in levels of transcription of the genes. Indeed, the control of the cell cycle and cell development, as well as diseases, may be characterized by variation in the transcription levels of genes.
Of particular interest to those in the bioinformatics area are systems for identifying the biological functions of genes based on their temporal pattern of expression. One system, known as clustering analysis, clusters genes according to the shape similarity of their temporal pattern of expression, with clusters related to specific biological functions. This approach has been applied to identify genes involved in a metabolic shift from the yeast genome (DeRisi et al., Science 278: 680-686 (1997)), and in the central nervous system development in rats (Wen et al., Proc. Natl. Acad. Sci. USA 95: 334-339 (1998)). A second approach is reverse engineering, which assumes that the genes dynamically interact with one another as a genetic network (Liang et al., Proceedings of the Pacific Symposium on Biocomputing, Maui, Hi., 1998). The reverse engineering approach can potentially systematically decipher the complex circuitry of the genetic network from the temporal gene expression pattern.
While such clustering analysis and reverse engineering systems are useful, it is desirable to have available a general and flexible system for the visualization, manipulation, and analysis of gene expression data. Such a system preferably includes a graphical user interface for browsing and navigating through the expression data, allowing a user to selectively view and highlight the genes of interest. The system also preferably includes sort and search functions and is preferably available for general users with PC, Mac or Unix workstations. Also preferably included in the system are clustering algorithms that are qualitatively more efficient than existing ones. The accuracy of such algorithms is preferably hierarchically adjustable so that the level of detail of clustering can be systematically refined as desired.
A preferred algorithm for such a system is a clustering algorithm for, e.g., identifying functionally related genes with different time curves. In particular, the clustering algorithm may be used for clustering genes whose functional correlation involves a scale change, a time delay, a vertical flip or any combination of the three. The system preferably also includes a time-curve representation that is both literal and numerical. Literal representations assist in making SQL (Standard Query Language) type database queries. Numerical representations assist in allowing for the arithmetical transformation of curves. Such transformations are useful in differentiating tissue and disease specificity of gene expression. In addition, clustering algorithms and mathematical calculations preferably are tightly integrated with a graphical user presentation interface. Finally, graphics preferably are included to assist in navigation and analysis of the expression data in an intuitive, interactive, and iterative fashion.
Indeed, there is a need for improved computer-aided techniques for the analysis and manipulation of gene expression data. The present invention reflects the preceding attributes and relates to systems and computer programs used for the analysis and manipulation of gene expression data. In a specific embodiment, the systems of the present invention comprise two new clustering algorithms, a presentation interface, and a set of graphical display tools. The system is preferably written in the Java.TM. programming language (e.g., 100% JDK 1.1, Sun Microsystems, Inc., Palo Alto, Calif.), and thus platform independent.