Within the past decade, several technologies have made it possible to monitor the expression level of large numbers of transcripts at any one time (see, e.g., Schena et al., 1995, Science 270:467-470; Lockhart et al., 1996, Nature Biotechnology 14:1675-1680;
Blanchard et al., 1996, Nature Biotechnology 14:1649; Ashby et al., U.S. Pat. No. 5,569,588, issued Oct. 29, 1996). In organisms for which the complete genome is known, it is possible to analyze the transcripts for all genes within the cell. Even with other organisms, including mammalian organisms such as humans, for which there is an increasing knowledge of the genome, it is possible to simultaneously monitor large numbers of the genes with a cell.
Such monitoring technologies have been applied, for example, to identify genes that are up-regulated or down-regulated in various diseased or physiological states, to analyze members of signaling cellular states, and to identify targets for various drugs. See, e.g., Friend and Hartwell, U.S. Provisional Patent Application Ser. No. 60/039,134, filed on Feb. 28, 1997; Stoughton, U.S. Pat. No. 6,132,969; Stoughton and Friend, U.S. Pat. No. 5,965,352; Friend and Hartwell, U.S. Provisional Application Ser. No. 60/056,109, filed on Aug. 20, 1997; Friend and Hartwell, U.S. Pat. No. 6,165,709; Friend and Stoughton, U.S. Provisional Application Ser. Nos. 60/084,742 (filed on May 8, 1998), 60/090,004 (filed on Jun. 19, 1998) and 60/090,046 (filed on Jun. 19, 1998).
Other methods have been described in the art for analyzing the large numbers of biological responses that can be measured using current array technology. In particular, methods are known in the art for “clustering” cellular constituents, such as gene transcripts (i.e., mRNAs) and gene products, according to their response to different “perturbations” (see, for example, Michaels et al., 1998, Pac. Symp. Biocomput.:42-53; Wen et al., 1998, Proc. Natl. Acad. Sci. U.S.A. 95:334-339; DeRisi et al., 1997, Science 278:680-686; Bryant et al., 1998, Pacific Symposium on Biocomputing 3:3-5; Carr et al., 1997, Statistical Computing & Statistical Graphics Newsletter pp. 20-29; D'haeseleer et al., 1998, “Mining the Gene Expression Matrix: Inferring Gene Relationships From Large Scale Gene Expression Data”.
Such analytical techniques include, for example, “clustering” cellular constituents according to the similarity of their responses to different perturbations, as well as clustering perturbations (e.g., genetic mutations, drug treatments, etc.) that similarly affect different cellular constituents, and/or two-dimensional clustering of both cellular constituents and perturbations (see, for example, U.S. Pat. Nos. 6,203,987, 6,801,859, 6,950,752 and 6,468,476; and PCT International Publication WO 00/24936 published May 4, 2000).
To date, most expression profiling studies have focused on particular genes that respond to certain conditions or treatments of interest. For example, Chu et al. (1998, Science 282:699-705) have shown that several previously uncharacterized genes that are induced upon yeast sporulation are required for completion of the sporulation program. However, the idea that the global transcription response itself can be used to characterize cells has also received attention (see, for example, DeRisi et al., 1997 Science 278:680-686; Gray et al., 1998 Science 281:533-538; Holstege et al., 1998, Cell 95:717-728; Marton et al., 1998, Nat. Med. 4:1293-1301; Roberts et al., 2000, Science 287:873-880). For example, tumors have been classified by their expression profiles (Perou et al., 1999, Proc. Natl. Acad. Sci. U.S.A. 96:9212-9217; Golub et al., 1999, Science 286:531-537; Alizadeh et al., 2000, Nature 403:503-511).
There remain many genes that have been fully sequenced, but that have not been fully characterized and for which there is no known biological function. For example, although the genome of the yeast Saccharomyces cerevisiae has been fully sequenced, of the 6275 open reading frames (ORFs) identified in that organism's genome, approximately one-third have no known biological function. In higher organisms, the fraction of genes with unknown biological function is much higher.
As ongoing sequencing efforts such as the human genome project near completion and whole genome sequences for many organisms become known, there is an increasing need for high throughput methods for determining biological functions for such uncharacterized genes. Further, there remain methods for more robust high throughput data analysis techniques, particularly robust methods for clustering expression profiles, that can be used in such high throughput analytical methods. The methods and compositions of the present invention therefore solve these and other problems in the prior art.
Discussion or citation of a reference herein shall not be construed as an admission that such reference is prior art to the present invention.