Acquisition of large-scale data sets representing a variety of data modalities has become a crucial aspect of the characterization of experimental systems. Such a strategy affords a broad capture of biological information in a short time and with a relative small investment of effort. Rich datasets are collected in hopes that valuable biological insights might be gained. The amount of collected information, however, can be overwhelming, making interpretation of the data difficult, and subsequent detailed biological understanding elusive.
Researchers have developed several strategies to address the management of large-scale data sets, and these strategies offer some ability to interpret the data and develop biological insight. Many of these solutions are based on measurement enrichment. For example, Gene Set Enrichment Analysis determines whether members of a gene set tend to occur toward a top (or bottom) of a list, in which case the gene set is correlated with a phenotypic class distinction. Enrichment can also be incorporated with pathway analysis where, for example, specific measurements are associated with elements of a particular biological pathway. In addition to visually connecting measurements in this way, enrichment scores can be generated using a pathway to define the set of genes. Rather than identifying the upstream pathways that lead to the data, many of these enrichment-based solutions interpret the data from a “consequence” point of view, assessing the functional impact of the changes themselves. This approach, however, requires certain assumptions about the data and its impact, such as assuming mRNA expression is directly correlated to the activity of the encoded protein. Indeed, the correlation of mRNA to encoded protein abundance is variable. Focusing on strictly consequential perspectives also fails to capture a major facet of the data that can be harnessed from an upstream “causal” perspective. Additionally, from a use perspective, the output of many of these existing data interpretation strategies is a measure of statistical enrichment, ultimately yielding a Boolean decision about pathway enrichment/activation rather than a measure of activation intensity.
Alternative strategies have been described that focus on uncovering a characteristic “signature” of measurements that results from one or more perturbations to a biological process, and subsequently scoring the presence of that signature in additional data sets as a measure of specific activity of that process. Most previous work of this type involves identifying and scoring signatures that are correlated with a disease phenotype. These phenotype-derived signatures provide significant classification power, but the lack of a mechanistic or causal relationship between a single specific perturbation and the signature means that the signature may represent multiple distinct unknown perturbations that lead to the same disease phenotype. A number of studies, however, have focused instead on measuring causal signatures based on very specific upstream perturbations either performed directly in the system of interest, or from closely-related published data. Based on the simple, yet powerful, premise that modulation of cellular pathways and the components therein are associated with distinct signatures in measured node entities, causally-derived signatures enable the “cause” of the signature to be identified with high specificity from the measured “effect.” These studies have demonstrated the great potential of applying a causal pathway scoring strategy to clinical problems, for example, by providing prognosis predictions in gastric cancer patients and indications of specific drug efficacy.
Given the vast potential of the information contained within large-scale data sets and the increasing ease at obtaining this data, new ways of mining understanding from these data sets have begun to be developed. Thus, for example, U.S. Publication No. 20120030162, which is commonly-owned, describes a method by which known techniques for causal pathway analysis of large data sets are extended to provide for a measure of intensity, which facilitates the comparison of biological states based on degree or amplitude of perturbation rather than comparison of likelihood of perturbation based on enrichment. According to that application, one or more measurement signatures are derived (e.g., from a knowledge base of casual biological facts), where a signature is a collection of measured node entities and their expected directions of change with respect to a reference node. The knowledge base may be a directed network of experimentally-observed casual relationships among biological entities and processes, and a reference node represents a potential perturbation to a biological entity or process (i.e., an entity that is hypothetically perturbed). A “degree of activation” of a signature is then assessed by scoring one or more “differential” data sets against the signature to compute an amplitude score, sometimes referred to as the “network perturbation amplitude” (NPA) metric. A “differential” data set is a data set having first and second conditions, e.g., a “treated” versus a “control” condition. In one embodiment, the amplitude score quantifies fold changes of measurements in the signature. A fold change is a number describing how much a quantity changes going from an initial to a final value.
While the above-described techniques provide significant advantages, it is desired to assess the computed NPA score's Uncertainty across specific experimental conditions to provide a confidence measure for the score. This disclosure describes such a solution.