1. Field of Invention
The present invention relates generally to compositions and methods of identifying groups of genes that can be used to differentiate between types of specific diseases such as breast cancer.
2. Description of Related Art
Sequential clustering method. The methods of high-throughput (HT) biology deeply impact disease research, particularly such heterogeneous and complex conditions as breast cancer. The rate of progress in OMICs technologies, data interpretation and applications is increasing. Over the last 2-3 years, a set of novel methods of functional analysis of large datasets (often collectively referred to as systems biology) have advanced into practice. After years of development “below the radar screen” of wet lab research, the data mining methods based on biological networks and pathways analysis, are now being applied in disease studies. However, despite impressive progress in research, the results and methods of HT biology have not yet advanced into clinics.
Global gene expression profiling in breast cancer. The second deadliest malignancy in Western women, breast cancer is a complex disease (or rather a collective “condition” describing a diverse set of disease states) which involves many tissues and cell subtypes. A large scope of factors is believed to contribute to the clinical manifestations such as survival, response to therapy, relapsing time, and risk factors for metastases. Breast cancer is diagnosed by imaging, there are no diagnostic DNA and/or protein markers used. It is now recognized that breast cancer cannot be properly classified with any single gene/protein/phenotype descriptor, and multiple gene/protein markers are needed in order to capture the complexity of the disease. The logistics for such multi-factorial analysis was provided by robust microarray and Systematic Analysis of Gene Expression (SAGE) technologies, and a number of large scale studies of “genome wide” expression profiling of breast tumors were conducted in the last 10 years. The two most clinically important issues addressed were the sub-categorization of cancers and the discovery of prognostic signatures for distant metastasis—the predominant cause of mortality due to breast cancer
Sub-categorization of breast cancers. One of the best known studies in molecular profiling is the series of works on clustering of invasive breast cancers by T. Sørlie et al. The studies were carried out on cDNA arrays developed in Patrick Brown's lab at Stanford on large samples sets; are well documented and directly comparable within series. First, distinct patterns for basal and luminal epithelium, stroma and lymphocytes were identified as intrinsic “signatures” unique for tumors. In the follow-up studies, 122 tissue samples (including 115 invasive breast carcinomas, 4 normal breast samples) were tested by two dimensional non-supervised hierarchical clustering of the “centroid” set of 540 genes. The expression patterns in tumors followed the signatures for particular cell lines: two types of luminal epithelium (A and B), basal epithelium, normal-like and ERBB+ clusters. The Luminal A type was characterized by high expression of estrogene receptor (ER) and several of its target genes LIV-1, HNF3A, XBP1 and GATA3. Luminal B type featured lower ER expression, and high expression of GGH, LAPTMB4, NCEP1 and CCNE1. The characteristic set for basal-like type included keratins 5 and 17, annexin 8, CX3CL1 and TRIM29. Basal type was ER negative and lacking expression of luminal a genes. Finally, the ERBB+ cluster featured high expression of some genes from ERBB amplicon: ERBB2, GRB7, TRAP100. The normal-like type had an expression pattern of adipose and other non-epithelial cell types. This clustering schema proved to be largely correct in another study on the same cDNA array on the set of 295 patients, and independent data-sets on different microaray platforms. In a more recent study, a different cluster, called “molecular apocrine” characterized as androgen receptor (AR) positive, ER negative group outside basal category, characterized by frequent ERBB2 amplification. Importantly, clinical phenotypes correlated with particular clusters. For instance, the basal and ERBB+ subtype were associated with the shortest patients survival time. However, despite robustness and usefulness in research and clinics, this “classic” sub-categorization schema is incomplete and suffers drawbacks. First, about ⅓ samples could not be parsed in any of 5 categories in both the 122 samples set and 295 samples sets. Such a high rate of unclassifiable patients limits the applicability of microarray clusters in clinic. Second, the clusters remain heterogeneous. These drawbacks pursued us to develop a radically different classification approach described in this application.
Functional analysis of global gene expression and other OMICs data. With thousands of data points and whole-genome representation, OMICs datasets undergo rigorous statistical analyses such as evaluation of significance of data points, supervised and unsupervised clustering, analysis of variance ANOVA, principal component analysis (PCA) and other tests). However, statistical analysis is insufficient for interpretation the “underlying biology” behind expression profiles. Typically, functional analysis of gene expression is limited to querying the statistically derived gene lists against cellular process ontologies such as GO. This is informative for “big picture” observations, for instance for association of a metastasis signature with “cell death, cell cycle, proliferation and DNA repair”. However, the resolution is inadequate for “fine mapping” of the dataset at the level of individual protein interactions and for revealing the combinations of biological events causative of the condition. We consider the major tools for data analysis as pathways, cellular processes and networks (signaling and metabolic). Pathways represent the linear chains of consecutive biochemical transformations or signaling protein interactions, experimentally confirmed as a whole. Functional processes are the folders, or categories, with proteins needed for execution of distinct cellular functions, such as apoptosis or fatty acids oxydation. Unlike pathways, the proteins within processes are not connected by functional links. Both pathways and process categories are static, i.e. experimental data can be mapped (via linking data points IDs) on the pathways and processes, but not alter them. Biological networks are essentially different. They represent combinations of objects—nodes (proteins, metabolites, genes etc.) interconnected by links (binary physical protein interactions, single step metabolic reactions, functional correlations) and assembled “on the fly” from the interaction tables. Biological networks are dynamic, as they are built de novo out of building blocks (binary interactions) and are specific for each dataset. A combination of networks, “interactome”, is probably the closest representation of the “cellular machinery” of active proteins which carry out cellular functions The complete set of interactions defines the potential of core cellular machinery, billions of physically possible multi-step combinations. Obviously, only a fraction of all possible interactions are activated at any given condition as not all genes are expressed at a time in a tissue, and only a fraction of the cellular protein pool is active. The subset of activated (or repressed) genes and proteins is unique for the experiment, and is captured by data “snapshots” such as microarrays. Once generated, the network(s) can be interpreted in terms of higher level processes, and the mechanism of an effect can be unraveled. This is achieved by linking the network objects to Gene Ontology (GO) and other process ontologies, as well as metabolic and signaling maps. The process of data analysis consists of narrowing down the list of potentially many thousands of data points to small sets of genes, interconnected functionally and most affected in the condition. This is achieved by using scoring methods for the intersections between functional categories, and calculation of the relevance to the dataset in question (relative saturation of pathways and networks with the data).
Unforeseen toxicity of new compounds in human is one of the leading reasons for drug withdrawal from the market and from clinical trials. A recent analysis of drugs entering preclinical studies suggested that between 2000 and 2002 only one of 13 molecules entering clinical studies reached the market. This and other studies together indicate that the cost for developing a drug including marketing is increasing and could be in excess of $1 billion. The inability to predict toxicity cost the pharmaceutical industry an estimated $8 billion in 2003. The critical significance of the drug failure problem created a steadily growing market for novel predictive technologies, such as toxicogenomics (assessment of compound toxicity in human based on pre-clinical molecular profiling) and pharmacogenomics (evaluation of individual susceptibility to drug efficacy and side effects based on genotype). The majority of toxicogenomic studies are drug-effect experiments on rat and mouse models, typically genome-wide gene expression profiling using a microarray platform. Current analytical software and empirical gene expression reference databases include Iconix DrugMatrix and GeneLogic's ToxExpress. This software has relatively low predictive power. However, this situation may drastically change with the development of the next generation tools for functional analysis and systems biology.
Microarrays provide a snapshot of the gene expression of thousand of genes in a given time and condition, one of the major challenges being to identify genes or groups of genes associated with different disease conditions. Most of the methods have been developed to identify significant changes in the expression levels of individual genes, however the expression changes can be experienced in the level of pathways or functionally related network modules. There is less work focusing on the simultaneous use of gene expression and interaction information. Related studies include identifying biologically active sub-networks of networks by using scoring functions of the sub-networks and other approaches focuses on gene expression changes of a pre-defined set of genes which are known to be biologically related. Our approach is similar in nature, focusing on identifying from pre-defined network modules the ones, which differentiate between any two treatment-conditions, i.e. they are characterized by an expression pattern which changes in a synchronous manner over the different treatment conditions Traditional methods based on the expression changes of individual genes are unable to identify these relatively small but simultaneous expression pattern changes.
OMICs-based approaches in mechanistic toxicology. Regulatory assessment of potential toxicant hazards is mandated for both food and drug compounds (FDA) and for industrial chemicals and commercial compounds that will be released into the environment (EPA). For both of these broad categories, risk assessment of toxicity from chemical or drug exposure is comprised of two related processes: hazard characterization and exposure assessment. It is widely accepted that a mechanistic understanding of the interaction of chemicals and living systems is required for a full evaluation of drug and chemical safety. The majority of ‘OMICs-based studies of mechanisms of chemical toxicity (toxicogenomics) are focused on transcriptional gene expression profiling and proteomic and metabolomic profiling methods are currently being developed to provide a comprehensive view of the toxicant response. Molecular profiling approaches are being increasingly integrated into pre-clinical drug safety evaluation and into clinical studies of toxicant-related disease. The underlying assumption behind toxicogenomics is that chemicals with similar mechanisms of toxicity will evoke similar patterns of gene expression in the affected target tissue. Therefore potential toxicity of a compound may be determined by a comparison of its gene expression pattern with a library of standardized expression patterns for xenobiotics with known toxicity. The liver and kidney are primary sites of xenobiotic metabolism and consequently, hepatotoxicity or nephrotoxicity manifestations. The reference expression pattern libraries are therefore focused mostly on rat liver as this is the most extensively utilized toxicity assay system. The patterns of gene expression can be presented in a variety of ways depending on the statistical procedures applied (from a list of most affected by expression amplitude genes to neural networks). Drug expression responses are mainly presented in the form of compound or class specific “gene signatures”, i.e. lists of genes (expression biomarkers) characteristic for a certain type of toxicity. Gene signatures are typically 50-100 genes strong and are diagnostic for exposure to a specific chemical with predictive power often >90%. Gene signatures are generated using different statistical techniques, such as unsupervised clustering and supervised pattern matching algorithms. The reference, compendium databases of toxicant signatures were assembled initially for yeast and subsequently for rodents.
It is well recognized by federal regulatory agencies that molecular profiling of toxicant exposure must be integrated with pathological and clinical profiling data of disease states. To support this integration for drug safety evaluation, the FDA has released a “Guidance for Industry on Pharmacogenomic Data Submissions”. Similarly the Science Policy Council has prepared an “Interim Policy on Genomics” and has organized its own internal “Framework for a Computational Toxicology Research Program in ORD”. Reference standards for experimental design and reporting for toxicogenomic studies have been loosely adopted as the MIAME-Tox standard. To date, no universally accepted standardized experimental platform or common database schema exists for toxicogenomic studies. At present, several public, academic, and commercial databases exist as specific repositories for toxicogenomic data (Table 1). Toxicogenomics and chemogenomics databases were instrumental in finding additional class specific gene signatures and prediction of both drug induced oxidative stress-mediated hepatotoxicity and drug induced renal tubule degeneration and nephrotoxicity.
TABLE 1Public and Commercial ToxicogenomicsDatabases, Protocols and ResourcesDatabaseSourceTypeArrayTox-MIAMEMGEDProtocolN/ACEBSNCT/NIEHS/NIHDB/publicMultipleArrayTrackFDADB/publicMultipleCTDMDIBLDB/publicPharmGKBStanford U.DB/academicMultipleEDGEU. WisconsinDB/academicsmouse, customdbZachMichigan State U.DB/academicrat, customToxExpressGeneLogicDB/commercialrat, AffymetrixDrugMatrixIconix PharmaDB/commercialrat, GE
Consequently, there is a need in the pharmacogenetics industry for effective and efficient methods for analyzing pathway databases to predict potential new drugs or treatments. Such a method could be used to evaluate proposed new drugs or treatments against the pathway databases to predict adverse and side effects that may be expected before expensive clinical trials are begun.