This section provides background information related to the present disclosure which is not necessarily prior art.
Together with the ability of generating a large amount of data per experiment, high throughput technologies also brought the challenge of translating such data into a better understanding of the underlying biological phenomena. Independent of the platform and the analysis methods used, the result of a high-throughput experiment is, in many cases, a list of differentially expressed genes. The common challenge faced by all researchers is to translate such lists of differentially expressed genes into a better understanding of the underlying biological phenomena and in particular, to put this in the context of the whole organism as a complex system. A computerized analysis approach using the Gene Ontology (GO) was proposed to deal with this issue. This approach takes a list of differentially expressed genes and uses a statistical analysis to identify the GO categories (e.g. biological processes, etc.) that are over- or under-represented in the condition under study. Given a set of differentially expressed genes, this approach compares the number of differentially expressed genes found in each category of interest with the number of genes expected to be found in the given category just by chance. If the observed number is substantially different from the one expected just by chance, the category is reported as significant. A statistical model (e.g. hypergeometric) can be used to calculate the probability of observing the actual number of genes just by chance, i.e., a p-value. Currently, there are over 20 tools using this over-representation approach (ORA). In spite of its wide adoption, this approach has a number of limitations related to the type, quality, and structure of the annotations available. An alternative approach considers the distribution of the pathway genes in the entire list of genes and performs a functional class scoring (FCS) which also allows adjustments for gene correlations. Arguably the state-of-the-art in the FCS category, the Gene Set Enrichment Analysis (GSEA), ranks all genes based on the correlation between their expression and the given phenotypes, and calculates a score that reflects the degree to which a given pathway P is represented at the extremes of the entire ranked list. The score is calculated by walking down the list of genes ordered by expression change. The score is increased for every gene that belongs to P and decreased for every gene that does not. Statistical significance is established with respect to a null distribution constructed by permutations.
Both ORA and FCS techniques currently used are limited by the fact that each functional category is analyzed independently without a unifying analysis at a pathway or system level. This approach is not well suited for a systems biology approach that aims to account for system level dependencies and interactions, as well as identify perturbations and modifications at the pathway or organism level. Several pathway databases such as KEGG, BioCarta, and Reactome, currently describe metabolic pathway and gene signaling networks offering the potential for a more complex and useful analysis. A recent technique, ScorePage, has been developed in an attempt to take advantage of this type of data for the analysis of metabolic pathways. Unfortunately, no such technique currently exists for the analysis of gene signaling networks. All pathway analysis tools currently available use one of the ORA approaches above and fail to take advantage of the much richer data contained in these resources. GenMAPP/MAPPfinder and GeneSifter use a standardized Z-score. PathwayProcessor, PathMAPA, Cytoscape and PathwayMiner use Fisher's exact test. MetaCore uses a hypergeometric model, while ArrayXPath offers both fisher's exact test and a false discovery rate (FDR). Finally, VitaPad and Pathway Studio focus on visualization alone and do not offer any analysis.
The approaches currently available for the analysis of gene signaling networks share a number of important limitations. Firstly, these approaches consider only the set of genes on any given pathway and ignore their position in those pathways. This may be unsatisfactory from a biological point of view. If a pathway is triggered by a single gene product or activated through a single receptor and if that particular protein is not produced, the pathway will be greatly impacted, probably completely shut off. If the insulin receptor (INSR) is not present, the entire pathway is shut off. Conversely, if several genes are involved in a pathway but they only appear somewhere downstream, changes in their expression levels may not affect the given pathway as much.
Secondly, some genes have multiple functions and are involved in several pathways but with different roles. For instance, the above INSR is also involved in the adherens junction pathway as one of the many receptor protein tyrosine kinases. However, if the expression of INSR changes, this pathway is not likely to be heavily perturbed because INSR is just one of many receptors on this pathway. Once again, all these aspects are not considered by any of the existing approaches.
Probably the most important challenge today is that the knowledge embedded in these pathways about how various genes interact with each other is not currently exploited. The very purpose of these pathway diagrams is to capture some of our knowledge about how genes interact and regulate each other. However, the existing analysis approaches consider only the sets of genes involved on these pathways, without taking into consideration their topology. In fact, our understanding of various pathways is expected to improve as more data is gathered. Pathways will be modified by adding, removing or re-directing links on the pathway diagrams. Most existing techniques are completely unable to even sense such changes. Thus, these techniques will provide identical results as long as the pathway diagram involves the same genes, even if the interactions between them are completely re-defined over time.
Finally, up to now the expression changes measured in these high throughput experiments have been used only to identify differentially expressed genes (ORA approaches) or to rank the genes (FCS methods), but not to estimate the impact of such changes on specific pathways. Thus, ORA techniques will see no difference between a situation in which a subset of genes is differentially expressed just above the detection threshold (e.g., 2 fold) and the situation in which the same genes are changing by many orders of magnitude (e.g., 100 fold). Similarly, FCS techniques can provide the same rankings for entire ranges of expression values, if the correlations between the genes and the phenotypes remain similar. Even though analyzing this type of information in a pathway and system context would be extremely meaningful from a biological perspective, currently there is no technique or tool able to do this.
We propose a radically different approach for pathway analysis that attempts to capture all aspects above. An impact factor (IF) is calculated for each pathway incorporating parameters such as the normalized fold change of the differentially expressed genes, the statistical significance of the set of pathway genes, and the topology of the signaling pathway. We show on a number of real data sets that the intrinsic limitations of the classical analysis produce both false positives and false negatives while the impact analysis provides biologically meaningful results.