1. FIELD OF THE INVENTION
2. BACKGROUND OF THE INVENTION
3. SUMMARY OF THE INVENTION
4. BRIEF DESCRIPTION OF THE DRAWINGS
5. DETAILED DESCRIPTION
5.1. Introduction
5.1.1. Definition of Biological State
5.1.2. Repredentation of Biological Responses
5.1.3. Overview of the Invention
5.2. Specific Embodiment: Defining Basis Genesets
5.2.1. Co-regulated Genes and Genesets
5.2.2. Geneset Classification by Cluster Analysis
5.2.3. Geneset Classification Based upon Mechanisms of Regulation
5.2.4. Refinement of Geneset and Geneset Definition Database
5.3. Representation of Gene Expression Profiles Based upon Basis Genesets
5.4. Application of Projected Profiles
5.4.1. Advantage of the Projected Profile
5.4.2. Profile Comparison and Classification
5.4.3. Illustrative Drug Discovery Applications
5.4.4. Illustrative Diagnostic Applications
5.4.5. Response Profile Classification by Cluster Analysis
5 5.4.6. Removal of Profile Artifacts
5.4.7. Projected Titration Curves
5.4.8. Use of Genesets in Microarrays
5.5. Computer Implementation
5.6. Analytic Kit Implementation
5.7. Methods for Determining Biological Response
5.7.1. Transcript Assay Using DNA Array
5.7.1.1. Preparing Nucleic Acids for Microarrays
5.7.1.2. Attaching Nucleic Acids to the Solid Surface
5.7.1.3. Target Polynucleotide Molecules
5.7.1.4. Hybridization to Microarrays
5.7.1.5. Signal Detection and Data Analysis
5.7.2. Pathway Response and Genesets
5.7.3. Measurement of Graded Perturbation Response Data
5.7.4. Other Methods of Transcriptional State Measurement
5.7.5. Measurement of other Aspects of Biological State
5.7.5.1. Embodiments Based on Translational State Measurements
5.7.5.2. Embodiments Based on other Aspects of the Biological State
5.8. Method for Probing Cellular States
5.8.1. Titratable Expression Systems
5.8.2. Transfection Systems for Mamalian Cells
5.8.3. Methods of Modifying RNA Abundances or Activities
5.8.4. Methods of Modifying Protein Abundances
5.8.5. Methods of Modifying Protein Activities
6. EXAMPLES
6.1. Example 1
xe2x80x83Clustering Genesets by Coregulation
6.1.1. Materials and Methods
6.1.2. Results and Discussion
6.2. Example 2
xe2x80x83Enhancing Detection of Response Pattern Using Geneset Average Response
6.3. Example 3
xe2x80x83Improved Classification of Drug Activity
6.4 Experiment 4
xe2x80x83Improved Classification of Biological Response Profiles
6.5. Example 5
xe2x80x83Projecting out Profile Artifacts
7. REFERENCES CITED
The field of this invention relates to methods for enhanced detection of biological responses to perturbations. In particular, it relates to methods for analyzing structure in biological expression patterns for the purposes of improving the ability to detect certain specific gene regulations and to classify more accurately the actions of compounds that produce complex patterns of gene regulation in the cell.
Within the past decade, several technologies have made it possible to monitor the expression level of a large number of transcripts at any one time (see, e.g. Schena et al., 1995, Quantitative monitoring of gene expression patterns with a complementary DNA micro-array, Science 270:467-470; Lockhart et al., 1996, Expression monitoring by hybridization to high-density oligonucleotide arrays, Nature Biotechnology 14:1675-1680; Blanchard et al., 1996, Sequence to array: Probing the genome""s secrets, Nature Biotechnology 14, 1649; U.S. Pat. No. 5,569,588, issued Oct. 29, 1996 to Ashby et al. entitled xe2x80x9cMethods for Drug Screeningxe2x80x9d). In organisms for which the complete genome is known, it is possible to analyze the transcripts of all genes within the cell. With other organisms, such as human, for which there is an increasing knowledge of the genome, it is possible to simultaneously monitor large numbers of the genes within the cell.
Such monitoring technologies have been applied to the identification of genes which are up regulated or down regulated in various diseased or physiological states, the analyses of members of signaling cellular states, and the identification of targets for various drugs. See, e.g., Friend and Hartwell, U.S. Provisional Patent Application Serial No. 60/039,134, filed on Feb. 28, 1997; Stoughton, U.S. patent application Ser. No. 09/099,722, filed on Jun. 19, 1998, now U.S. Pat. No. 6,132,969; Stoughton and Friend, U.S. patent application Ser. No. 09/074,983, filed on May 8, 1998, now U.S. Pat. No. 5,965,352; Friend and Hartwell, U.S. Provisional Application Serial No. 60/056,109, filed on Aug. 20, 1997; Friend and Hartwell, U.S. application Ser. No. 09/031,216, filed on Feb. 26, 1998, now U.S. Pat. No. 6,165,709; Friend and Stoughton, U.S. Provisional Application Serial Nos. 60/084,742 (filed on May 8, 1998), No. 60/090,004 (filed on Jun. 19, 1998) and No. 60/090,046 (filed on Jun. 19, 1998), all incorporated herein by reference for all purposes.
Levels of various constituents of a cell are known to change in response to drug treatments and other perturbations of the cell""s biological state. Measurements of a plurality of such xe2x80x9ccellular constituentsxe2x80x9d therefore contain a wealth of information about the effect of perturbations and their effect on the cell""s biological state. Such measurements typically comprise measurements of gene expression levels of the type discussed above, but may also include levels of other cellular components such as, but by no means limited to, levels of protein abundances, or protein activity levels. The collection of such measurements is generally referred to as the xe2x80x9cprofilexe2x80x9d of the cell""s biological state.
The number of cellular constituents is typically on the order of a hundred thousand for mammalian cells. The profile of a particular cell is therefore typically of high complexity. Any one perturbing agent may cause a small or a large number of cellular constituents to change their abundances or activity levels. Not knowing what to expect in response to any given perturbation will therefore require measuring independently the responses of these about 105 constituents if the action of the perturbation is to be completely or at least mostly characterized. The complexity of the biological response data coupled with measurement errors makes such an analysis of biological response data a challenging task.
Current techniques for quantifying profile changes suffer from high rates of measurement error such as false detection, failures to detect, or inaccurate quantitative determinations. Therefore, there is a great demand in the art for methods to enhance the detection of structure in biological expression patterns. In particular, there is a need to find groups and structure in sets of measurements of cellular constituents, e.g., in the profile of a cell""s biological state. Examples of such structure include associations between the regulation of the expression levels of different genes, associations between different drug or drug candidates, and association between the drugs and the regulation of sets of genes.
Discussion or citation of a reference herein shall not be construed as an admission that such reference is prior art to the present invention.
This invention provides methods for enhancing detection of structures in the response of biological systems to various perturbations, such as the response to a drug, a drug candidate or an experimental condition designed to probe biological pathways as well as changes in biological systems that correspond to a particular disease or disease state, or to a treatment of a particular disease or disease state. The methods of this invention have extensive applications in the areas of drug discovery, drug therapy monitoring, genetic analysis, and clinical diagnosis. This invention also provides apparatus and computer instructions for performing the enhanced detection of biological response patterns, drug discovery, monitoring of drug therapies, genetic analysis, and clinical diagnosis.
One aspect of the invention provides methods for classifying cellular constituents (measurable biological variables, such as gene transcripts and protein activities) into groups based upon the co-variation among those cellular constituents. Each of the groups has cellular constituents that co-vary in response to perturbations. Those groups are termed cellular constituent sets.
In some specific embodiments, genes are grouped according to the degree of co-variation of their transcription, presumably co-regulation. Groups of genes that have co-varying transcripts are termed genesets. Cluster analysis or other statistical classification methods are used to analyze the co-variation of transcription of genes in response to a variety of perturbations. In preferred embodiments, the cluster analysis or other statistical classification methods use a novel xe2x80x9cdistancexe2x80x9d or xe2x80x9csimilarityxe2x80x9d metric to evaluate the similarity (i.e., the co-variance) of two or more genes (or other cellular constituents) in response to the variety of perturbations. In one specific embodiment, clustering algorithms are applied to expression profiles (e.g., a collection of transcription rates of a number of genes) obtained under a variety of cellular perturbations to construct a xe2x80x9csimilarity treexe2x80x9d or xe2x80x9cclustering treexe2x80x9d which relates cellular constituents by the amount of co-regulation exhibited. Genesets are defined on the branches of a clustering tree by cutting across the clustering tree at different levels in the branching hierarchy. In some embodiments, the cutting level is chosen based upon the number of distinct response pathways expected for the genes measured. In some other embodiments, the tree is divided into as many branches as they are truly distinct in terms of minimal distance value between the individual branches.
In some preferred embodiments, objective statistical tests are employed to define truly distinct branches. One exemplary embodiment of such a statistical test employs Monte Carlo randomization of the perturbation index for each gene""s responses across all perturbations tested. In some preferred embodiments, the cut off level is set so that branching is significant at the 95% confidence level. In preferred embodiments, clusters with one or two genes are discarded. In some other embodiments, however, small clusters with one or two genes are included in genesets. In more detail, the preferred statistical tests of the invention comprise (a) obtaining a measure of the xe2x80x9ccompactnessxe2x80x9d of clusters (i.e., cellular constituent sets such as gene sets) determined by the above mentioned cluster analysis or other statistical techniques, and (b) comparing the thus obtained measure of compactness to a hypothetical measure of compactness of cellular constituents regrouped in an increased number of clusters. Such a comparison typically comprises determining the difference in the compactness of the two sets of clusters. Further, by employing Monte Carlo randomization of the perturbation index for each gene""s responses across all perturbations tested, a statistical distribution of the difference in the compactness is thus generated. The statistical significance of the actual difference in compactness can then be determined by comparing this actual difference in compactness to the statistical distribution of the differences in compactness from the Monte Carlo randomizations.
As the diversity of perturbations in the clustering set becomes very large, the genesets which are clearly distinguishable get smaller and more numerous. However, it is a discovery of the inventors that even over very large experiment sets, there is a number of genesets that retain their coherence. These genesets are termed irreducible genesets. In some embodiments of the invention, a large number of diverse perturbations are applied to obtain these irreducible genesets.
Statistically derived genesets may be refined using regulatory sequence information to confirm members that are co-regulated, or to identify more tightly co-regulated subgroups. In such embodiments, genesets may be defined by their response pattern to individual biological experimental perturbations such as specific mutations, or specific growth conditions, or specific compounds. The statistically derived genesets may be further refined based upon biological understanding of gene regulation. In another preferred embodiment, classification of genes into genesets is based first upon the known regulatory mechanisms of genes. Sequence homology of regulatory regions is used to define the genesets. In some embodiments, genes with common promoter sequences are grouped into one geneset.
In preferred embodiments, the cluster analysis and statistical classification methods of this invention analyze co-variation, e.g., of transcription levels of individual genes, by means of an objective, quantitative xe2x80x9csimilarityxe2x80x9d or xe2x80x9cdistancexe2x80x9d function which provides a useful measurement of the similarity of expression levels for two or more cellular constituents (e.g., for two or more genes). Accordingly, the present invention provides novel similarity or distance function which are particularly useful for analyzing the co-variation of cellular constituents, including the co-variation of gene transcript levels. The invention also provides objective statistical tests, in particular Monte Carlo procedures, for assessing the significance of the cellular constituent sets or genesets obtained by the methods of this invention. Finally, the clustering methods of this invention are equally applicable to the clustering of both cellular constituents and biological profiles according to their similarities. Thus, in another aspect, the present invention provides methods for simultaneous clustering in both dimension of a tabular data set. In preferred embodiments, the data set is a table of numbers representing the levels or changes in level, of a plurality of cellular constituents in response to different conditions, perturbations, or conditions pairs.
Another aspect of the invention provides methods for expressing the state (or biological responses) of a biological sample on the basis of co-varying cellular constituent sets. In some embodiments, a profile containing a plurality of measurements of cellular constituents in a biological sample is converted into a projected profile containing a plurality of cellular constituent set values according to a definition of co-varying basis cellular constituent sets. In some preferred embodiments, the cellular constituent set values are the average of the cellular constituent values within a cellular constituent set. In some other embodiments, the cellular constituent set values are derived from a linear projection process. The projection operation expresses the profile on a smaller and biologically more meaningful set of coordinates, reducing the effects of measurement errors by averaging them over each cellular constituent sets, and aiding biological interpretation of the profile.
The method of the invention is particularly useful for the analysis of gene expression profiles. In some embodiments, a gene expression profile, such as a collection of transcription rates of a number of genes, is converted to a projected gene expression profile. The projected gene expression profile is a collection of geneset expression values. The conversion is achieved, in some embodiments, by averaging the transcription rate of the genes within each geneset. In some other embodiments, other linear projection processes may be used.
In yet another aspect of the invention, methods for comparing cellular constituent set values, particularly, geneset expression values are provided. In some embodiments, the expression of at least 10, preferably more than 100, more preferably more than 1,000 genes of a biological system is monitored. A known drug is applied to the system to generate a known drug response profile in terms of genesets. A drug candidate is also applied to the biological system to obtain a drug candidate response profile in terms of genesets. The drug candidate""s response profile is then compared with the knowin drug response profile to determine whether the drug candidate induces a response similar to the response to a known drug.
In some other embodiments, the comparison of projected profiles. is achieved by using an objective measure of similarity. In some preferred embodiments, the objective measure is the generalized angle between the vectors representing the projections of the two profiles being compared (the xe2x80x98normalized dot productxe2x80x99). In some other embodiments, the projected profiles are analyzed by applying threshold to the amplitude associated with each geneset for the projected profile. If the change of a geneset is above a threshold, it is declared that a change is present in the geneset.
The methods of the present invention may also be used to group biological response profiles according to the similarity of the responses of measured cellular constituents. Accordingly, in alternative embodiments, the present invention provides methods for grouping biological responses (i.e., response profiles) according to the degree of similarity of the cellular constituents"" responses by means of the cluster analysis or other statistical classification methods described supra for classification of cellular constituents (e.g. genes) into co-varying sets (e.g., genesets). Such methods may also be used, e.g., for enhancing detection of structures in the responses of biological systems to various perturbations. Still further, the present invention also provides xe2x80x9ctwo-dimensionalxe2x80x9d methods of analyzing biological response profile data. Such methods simply comprise (1) grouping cellular constituents (e.g., genes) according to their degree of co-variation in the response profile data, and (2) grouping response profiles according to the similarity of their cellular constituents"" responses.
The clustering methods of the invention are particularly useful, e.g., for identifying and/or characterizing perturbations (for example, drugs, drug candidates or genetic mutations) affecting particular cellular constituents or particular groups of cellular constituents. For example, the clustering methods can be used to identify cellular constituents (e.g., genes and proteins) and/or sets of co-varying cellular constituents such as genesets whose changes in expression or abundance are associated with a particular biological effect such as a particular disease state or the effect of one or more drugs. Further, the clustering methods of the invention are also useful, e.g., for identifying cellular constituents, such as genes or gene transcripts, involved in a particular biological response or pathway. Thus, the invention further provides methods for identifying cellular constituents, such as genes or gene transcripts, associated with a particular biological response or pathway by means of the cluster analysis methods described supra. The invention still further provides methods for identifying biological xe2x80x9cperturbationsxe2x80x9d, for example drugs, drug candidates, or genetic mutations which xe2x80x9cperturbxe2x80x9d a biological system, effecting particular cellular constituents or particular groups of cellular constituents by means of the cluster analysis methods described supra. The cellular constituents and perturbations identified by the methods of the invention may be known or previously unknown. Thus, the invention provides methods for identifying, e.g., novel genes and drugs or drug candidates as well previously known genes and drugs/drug candidates which were not previously known to be associated with a particular biological effect of interest.
The methods of the present invention may also be used to remove one or more artifacts from a measured biological profile (i.e., from a measure profile comprising a plurality of measurements of cellular constituents). Thus, the invention provides methods for removing such artifacts from a measured biological profile by subtracting one or more artifact patterns from the measured biological profile, wherein each artifact pattern corresponds to a particular artifact.
The methods of the invention are preferably implemented with a computer system capable of executing cluster analysis and projection operations. In some embodiments, a computer system contains a computer-usable medium having computer readable program code embodied. The computer code is used to effect retrieving a definition of basis genesets from a database and converting a gene expression profile into a projected expression profile according to the retrieved definition.