A central premise in modern cancer treatment is that patient diagnosis, prognosis, risk assessment, and treatment response prediction can be improved by stratification of cancers based on genomic, transcriptional and epigenomic characteristics of the tumor alongside relevant clinical information gathered at the time of diagnosis (for example, patient history, tumor histology and stage) as well as subsequent clinical follow-up data (for example, treatment regimens and disease recurrence events).
While several high-throughput technologies have been available for probing the molecular details of cancer, only a handful of successes have been achieved based on this paradigm. For example, 25% of breast cancer patients presenting with a particular amplification or overexpression of the ERBB2 growth factor receptor tyrosine kinase can now be treated with trastuzumab, a monoclonal antibody targeting the receptor (Vogel C, Cobleigh M A, Tripathy D, Gutheil J C, Harris L N, Fehrenbacher L, Slamon D J, Murphy M, Novotny W F, Burchmore M, Shak S, Stewart S J. First-line, single-agent Herceptin® (trastuzumab) in metastatic breast cancer. A preliminary report. Eur. J. Cancer 2001 January; 37 Suppl 1:25-29).
However, even this success story is clouded by the fact that fewer than 50% of patients with ERBB2-positive breast cancers actually achieve any therapeutic benefit from trastuzumab, emphasizing our incomplete understanding of this well-studied oncogenic pathway and the many therapeutic-resistant mechanisms intrinsic to ERBB2-positive breast cancers (Park J W, Neve R M, Szollosi J, Benz C C. Unraveling the biologic and clinical complexities of HER2. Clin. Breast Cancer 2008 October; 8(5):392-401.)
This overall failure to translate modern advances in basic cancer biology is in part due to our inability to comprehensively organize and integrate all of the omic features now technically acquirable on virtually any type of cancer. Despite overwhelming evidence that histologically similar cancers are in reality a composite of many molecular subtypes, each with significantly different clinical behavior, this knowledge is rarely applied in practice due to the lack of robust signatures that correlate well with prognosis and treatment options.
Cancer is a disease of the genome that is associated with aberrant alterations that lead to disregulation of the cellular system. What is not clear is how genomic changes feed into genetic pathways that underlie cancer phenotypes. High-throughput functional genomics investigations have made tremendous progress in the past decade (Alizadeh A A, Eisen M B, Davis R E, Ma C, Lossos I S, Rosenwald A, Boldrick J C, Sabet H, Tran T, Yu X, Powell J I, Yang L, Marti G E, Moore T, Hudson J, Lu L, Lewis D B, Tibshirani R, SHERLOCK G, Chan W C, Greiner T C, Weisenburger D D, Armitage J O, Warnke R, Levy R, Wilson W, Greyer M R, Byrd J C, Botstein D, Brown P O, Staudt L M. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 2000 February; 403(6769):503-511.; Golub T R, Slonim D K, Tamayo P, Huard C, Gaasenbeek M, Mesirov J P, Coller H, Loh M L, Downing J R, Caligiuri M A, Bloomfield C D, Lander E S. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999 October; 286(5439):531-537.; van de Vijver M J, He Y D, van t Veer L J, Dai H, Hart A A M, Voskuil D W, Schreiber G J, Peterse J L, Roberts C, Marton M J, Parrish M, Atsma D, Witteveen A, Glas A, Delahaye L, van der Velde T, Bartelink H, Rodenhuis S, Rutgers E T, Friend S H, Bernards R. A Gene-Expression Signature as a Predictor of Survival in Breast Cancer. N Engl J Med 2002 December; 347(25):1999-2009.)
However, the challenges of integrating multiple data sources to identify reproducible and interpretable molecular signatures of tumorigenesis and progression remain elusive. Recent pilot studies by TCGA and others make it clear that a pathway-level understanding of genomic perturbations is needed to understand the changes observed in cancer cells. These findings demonstrate that even when patients harbor genomic alterations or aberrant expression in different genes, these genes often participate in a common pathway. In addition, and even more striking, is that the alterations observed (for example, deletions versus amplifications) often alter the pathway output in the same direction, either all increasing or all decreasing the pathway activation. (See Parsons D W, Jones S, Zhang X, Lin J C H, Leary R J, Angenendt P, Mankoo P, Carter H, Siu I, Gallia G L, Olivi A, McLendon R, Rasheed B A, Keir S, Nikolskaya T, Nikolsky Y, Busam D A, Tekleab H, Diaz L A, Hartigan J, Smith D R, Strausberg R L, Marie S K N, Shinjo S M O, Yan H, Riggins G J, Bigner D D, Karchin R, Papadopoulos N, Parmigiani G, Vogelstein B, Velculescu V E, Kinzler K W. An Integrated Genomic Analysis of Human Glioblastoma Multiforme. Science 2008 September; 321(5897):1807-1812.; Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008 October; 455(7216):1061-1068.)
Approaches for interpreting genome-wide cancer data have focused on identifying gene expression profiles that are highly correlated with a particular phenotype or disease state, and have led to promising results. Methods using analysis of variance, false-discovery, and non-parametric methods have been proposed. (See Troyanskaya et al., 2002) have been proposed. Allison D B, Cui X, Page G P, Sabripour M. Microarray data analysis: from disarray to consolidation and consensus. Nat. Rev. Genet. 2006 January; 7(1):55-65.; Dudoit S, Fridlyand J. A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 2002 June; 3(7):RESEARCH0036-RESEARCH0036.21.; Tusher V G, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. U.S.A. 2001 April; 98(9):5116-5121; Kerr M K, Martin M, Churchill G A. Analysis of variance for gene expression microarray data. J. Comput. Biol. 2000; 7(6):819-837; Storey J D, Tibshirani R. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. U.S.A. 2003 August; 100(16):9440-9445; and Troyanskaya O G, Garber M E, Brown P O, Botstein D, Altman R B. Nonparametric methods for identifying differentially expressed genes in microarray data. Bioinformatics 2002 November; 18(11): 1454-1461.)
Several pathway-level approaches use statistical tests based on overrepresentation of genesets to detect whether a pathway is perturbed in a disease condition. In these approaches, genes are ranked based on their degree of differential activity, for example as detected by either differential expression or copy number alteration. A probability score is then assigned reflecting the degree to which a pathway's genes rank near the extreme ends of the sorted list, such as is used in gene set enrichment analysis (GSEA) (Subramanian A, Tamayo P, Mootha V K, Mukherjee S, Ebert B L, Gillette M A, Paulovich A, Pomeroy S L, Golub T R, Lander E S, Mesirov J P. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 2005 October; 102(43):15545-15550.). Other approaches include using a hypergeometric test-based method to identify Gene Ontology (Ashburner M, Ball C A, Blake J A, Botstein D, Butler H, Cherry J M, Davis A P, Dolinski K, Dwight S S, Eppig J T, Harris M A, Hill D P, Issel-Tarver L, Kasarskis A, Lewis S, Matese J C, Richardson J E, Ringwald M, Rubin G M, SHERLOCK G. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000 May; 25(1):25-29.) or MIPS mammalian protein-protein interaction (Pagel P, Kovac S, Oesterheld M, Brauner B, Dunger-Kaltenbach I, Frishman G, Montrone C, Mark P, Stümpflen V, Mewes H, Ruepp A, Frishman D. The MIPS mammalian protein-protein interaction database. Bioinformatics 2005 March; 21(6):832-834.) categories enriched in differentially expressed genes (Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander E S, Golub T R. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. U.S.A. 1999 March; 96(6):2907-2912.).
Overrepresentation analyses are limited in their efficacy because they do not incorporate known interdependencies among genes in a pathway that can increase the detection signal for pathway relevance. In addition, they treat all gene alterations as equal, which is not expected to be valid for many biological systems.
Further complicating the issue is the fact that many genes (for example, microRNAs) are pleiotropic, acting in several pathways with different roles (Maddika S, Ande S R, Panigrahi S, Paranjothy T, Weglarczyk K, Zuse A, Eshraghi M, Manda K D, Wiechec E, Los M. Cell survival, cell death and cell cycle pathways are interconnected: implications for cancer therapy. Drug Resist. Updat. 2007 January; 10(1-2):13-29). Because of these factors, overrepresentation analyses often miss functionally-relevant pathways whose genes have borderline differential activity. They can also produce many false positives when only a single gene is highly altered in a small pathway. Our collective knowledge about the detailed interactions between genes and their phenotypic consequences is growing rapidly.
While the knowledge was traditionally scattered throughout the literature and hard to access systematically, new efforts are cataloging pathway knowledge into publicly available databases. Some of the databases that include pathway topology are Reactome (Joshi-Tope G, Gillespie M, Vastrik I, D'Eustachio P, Schmidt E, de Bono B, Jassal B, Gopinath G R, Wu G R, Matthews L, Lewis S, Birney E, Stein L. Reactome: a knowledgebase of biological pathways. Nucleic Acids Res. 2005 January; 33(Database issue):D428-32; Ogata H, Goto S, Sato K, Fujibuchi W, Bono H, Kanehisa M. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 1999 January; 27(1):29-34.)) and the NCI Pathway Interaction Database. Updates to these databases are expected to improve our understanding of biological systems by explicitly encoding how genes regulate and communicate with one another. A key hypothesis is that the interaction topology of these pathways can be exploited for the purpose of interpreting high-throughput datasets.
Until recently, few computational approaches were available for incorporating pathway knowledge to interpret high-throughput datasets. However, several newer approaches have been proposed that incorporate pathway topology (Efroni S, Schaefer C F, Buetow K H. Identification of key processes underlying cancer phenotypes using biologic pathway analysis. PLoS ONE 2007; 2(5):e425.). One approach, called Signaling Pathway Impact Analysis (SPIA), uses a method analogous to Google's PageRank to determine the influence of a gene in a pathway (Tarca A L, Draghici S, Khatri P, Hassan S S, Mittal P, Kim J, Kim C J, Kusanovic J P, Romero R. A novel signaling pathway impact analysis. Bioinformatics 2009 January; 25(1):75-82.) In SPIA, more influence is placed on genes that link out to many other genes. SPLA was successfully applied to different cancer datasets (lung adenocarcinoma and breast cancer) and shown to outperform overrepresentation analysis and Gene Set Enrichment Analysis for identifying pathways known to be involved in these cancers. While SPIA represents a major step forward in interpreting cancer datasets using pathway topology, it is limited to using only a single type of genome-wide data.
New computational approaches are needed to connect multiple genomic alterations such as copy number, DNA methylation, somatic mutations, mRNA expression and microRNA expression. Integrated pathway analysis is expected to increase the precision and sensitivity of causal interpretations for large sets of observations since no single data source is likely to provide a complete picture on its own.
In the past several years, approaches in probabilistic graphical models (PGMs) have been developed for learning causal networks compatible with multiple levels of observations. Efficient algorithms are available to learn pathways automatically from data (Friedman N, Goldszmidt M. (1997) Sequential Update of Bayesian Network Structure. In: Proceedings of the Thirteenth Conference on Uncertainty in Artificial Intelligence (UAI'97), Morgan Kaufmann Publishers, pp. 165-174; Murphy K, Weiss Y. Loopy belief propagation for approximate inference: An empirical study. In: Proceedings of Uncertainty in AI. 1999) and are well adapted to problems in genetic network inference (Friedman N. Inferring cellular networks using probabilistic graphical models. Science 2004 February; 303(5659):799-805.). As an example, graphical models have been used to identify sets of genes that form ‘modules’ in cancer biology (Segal E, Friedman N, Kaminski N, Regev A, Koller D. From signatures to models: understanding cancer using microarrays. Nat Genet 2005 June; 37 Suppl:S38-45.). They have also been applied to elucidate the relationship between tumor genotype and expression phenotypes (Lee S, Pe'er D, Dudley A M, Church GM, Koller D. Identifying regulatory mechanisms using individual variation reveals key role for chromatin modification. Proc. Natl. Acad. Sci. U.S.A. 2006 September; 103(38):14062-14067.), and infer protein signal networks (Sachs K, Perez O, Pe'er D, Lauffenburger D A, Nolan G P. Causal protein-signaling networks derived from multiparameter single-cell data. Science 2005 April; 308(5721):523-529.) and recombinatorial gene regulatory code (Beer M A, Tavazoie S. Predicting gene expression from sequence. Cell 2004 April; 117(2):185-198.). In particular, factor graphs have been used to model expression data (Gat-Viks I, Shamir R. Refinement and expansion of signaling pathways: the osmotic response network in yeast. Genome Research 2007 March; 17(3):358-367.; Gat-Viks I, Tanay A, Raijman D, Shamir R. The Factor Graph Network Model for Biological Systems. In: Hutchison D, Kanade T, Kittler J, Kleinberg J M, Mattern F, Mitchell J C, Naor M, Nierstrasz O, Pandu Rangan C, Steffen B, Sudan M, Terzopoulos D, Tygar D, Vardi M Y, Weikum G, Miyano S, Mesirov J, Kasif S, Istrail S, Pevzner P A, Waterman M, editors. Berlin, Heidelberg: Springer Berlin Heidelberg; 2005 p. 31-47.; Gat-Viks I, Tanay A, Raijman D, Shamir R. A probabilistic methodology for integrating knowledge and experiments on biological networks. J. Comput. Biol. 2006 March; 13(2):165-181.).
Breast cancer is clinically and genomically heterogeneous and is composed of several pathologically and molecularly distinct subtypes. Patient responses to conventional and targeted therapeutics differ among subtypes motivating the development of marker guided therapeutic strategies. Collections of breast cancer cell lines mirror many of the molecular subtypes and pathways found in tumors, suggesting that treatment of cell lines with candidate therapeutic compounds can guide identification of associations between molecular subtypes, pathways and drug response. In a test of 77 therapeutic compounds, nearly all drugs show differential responses across these cell lines and approximately half show subtype-, pathway and/or genomic aberration-specific responses. These observations suggest mechanisms of response and resistance that may inform clinical drug deployment as well as efforts to combine drugs effectively.
The accumulation of high throughput molecular profiles of tumors at various levels has been a long and costly process worldwide. Combined analysis of gene regulation at various levels may point to specific biological functions and molecular pathways that are deregulated in multiple epithelial cancers and reveal novel subgroups of patients for tailored therapy and monitoring. We have collected high throughput data at several molecular levels derived from fresh frozen samples from primary tumors, matched blood, and with known micrometastases status, from approximately 110 breast cancer patients (further referred to as the MicMa dataset). These patients are part of a cohort of over 900 breast cancer cases with information about presence of disseminated tumor cells (DTC), long-term follow-up for recurrence and overall survival. The MicMa set has been used in parallel pilot studies of whole genome mRNA expression (1 Naume, B. et al., (2007), Presence of bone marrow micrometastasis is associated with different recurrence risk within molecular subtypes of breast cancer, 1: 160-171), arrayCGH (Russnes H G, Vollan H K M, Lingjaerde O C, Krasnitz A, Lundin P, Naume B, Sørlie T, Borgen E, Rye I H, Langerød A, Chin S, Teschendorff A E, Stephens P J, Måner S, Schlichting E, Baumbusch L O, Kaåresen R, Stratton M P, Wigler M, Caldas C, Zetterberg A, Hicks J, Børresen-Dale A. Genomic architecture characterizes tumor progression paths and fate in breast cancer patients. Sci Transl Med 2010 June; 2(38):38ra47), DNA methylation (Rønneberg J A, Fleischer T, Solvang H K, Nordgard S H, Edvardsen H, Potapenko I, Nebdal D, Daviaud C, Gut I, Bukholm I, Naume B, Børresen-Dale A, Tost J, Kristensen V. Methylation profiling with a panel of cancer related genes: association with estrogen receptor, TP53 mutation status and expression subtypes in sporadic breast cancer. Mol Oncol 2011 February; 5(1):61-76), whole genome SNP and SNP-CGH (Van, Loo P. et al., (2010), Allele-specific copy number analysis of tumors, 107: 16910-169154), whole genome miRNA expression analyses (5 Enerly, E. et al., (2011), miRNA-mRNA Integrated Analysis Reveals Roles for miRNAs in Primary Breast Tumors, 6: e16915-), TP53 mutation status dependent pathways and high throughput paired end sequencing (7 Stephens, P. J. et al., (2009), Complex landscapes of somatic rearrangement in human breast cancer genomes, 462: 1005-1010). This is a comprehensive collection of high throughput molecular data performed by a single lab on the same set of primary tumors of the breast.
A topic of great importance in cancer research is the identification of genomic aberrations that drive the development of cancer. Utilizing whole-genome copy number and expression profiles from the MicMa cohort, we defined several filtering steps, each designed to identify the most promising candidates among the genes selected in the previous step. The first two steps involve identification of commonly aberrant and in-cis correlated to expression genes, i.e. genes for which copy number changes have substantial effect on expression. Subsequently, the method considers in-trans effects of the selected genes to further narrow down the potential novel candidate driver genes (Miriam Ragle Aure, Israel Steinfeld Lars Oliver Baumbusch Knut Liestøl Doron Lipson Bjorn Naume Vessela N. Kristensen Anne-Lise Børresen-Dale Ole-Christian Lingaerde and Zohar Yakhini, (2011), A robust novel method for the integrated analysis of copy number and expression reveals new candidate driver genes in breast cancer). Recently we developed an allele-specific copy number analysis enabling us to accurately dissect the allele-specific copy number of solid tumors (ASCAT), and simultaneously estimating and adjusting for both tumor ploidy and nonaberrant cell admixture (Van, Loo P. et al., (2010), Allele-specific copy number analysis of tumors, 107: 16910-169154). This allows calculation of genome-wide allele-specific copy-number profiles from which gains, losses, copy number-neutral events, and loss of heterozygosity (LOH) can accurately be determined. Observing DNA aberrations in allele specific manner allowed us to construct a genome-wide map of allelic skewness in breast cancer, indicating loci where one allele is preferentially lost, whereas the other allele is preferentially gained. We hypothesize that these alternative alleles have a different influence on breast carcinoma development. We could also see that Basal-like breast carcinomas have a significantly higher frequency of LOH compared with other subtypes, and their ASCAT profiles show large-scale loss of genomic material during tumor development, followed by a whole-genome duplication, resulting in near-triploid genomes (Van et al. (2010) supra). Distinct global DNA methylation profiles have been reported in normal breast epithelial cells as well as in breast tumors.
There is currently a need to provide methods that can be used in characterization, diagnosis, prevention, treatment, and determining outcome of diseases and disorders.