Acquisition of large-scale data sets representing a variety of data modalities has become a crucial aspect of the characterization of experimental systems. Such a strategy affords a broad capture of biological information in a short time and with a relative small investment of effort. Rich datasets are collected in hopes that valuable biological insights might be gained. The amount of collected information, however, can be overwhelming, making interpretation of the data difficult, and subsequent detailed biological understanding elusive.
Researchers have developed several strategies to address the management of large-scale data sets, and these strategies offer some ability to interpret the data and develop biological insight. Many of these solutions are based on measurement enrichment. For example, Gene Set Enrichment Analysis determines whether members of a gene set tend to occur toward a top (or bottom) of a list, in which case the gene set is correlated with a phenotypic class distinction. Enrichment can also be incorporated with pathway analysis where, for example, specific measurements are associated with elements of a particular biological pathway. In addition to visually connecting measurements in this way, enrichment scores can be generated using a pathway to define the set of genes. Rather than identifying the upstream pathways that lead to the data, many of these enrichment-based solutions interpret the data from a “consequence” point of view, assessing the functional impact of the changes themselves. This approach, however, requires certain assumptions about the data and its impact, such as assuming mRNA expression is directly correlated to the activity of the encoded protein. Indeed, the correlation of mRNA to encoded protein abundance is variable. Focusing on strictly consequential perspectives also fails to capture a major facet of the data that can be harnessed from an upstream “causal” perspective. Additionally, from a use perspective, the output of many of these existing data interpretation strategies is a measure of statistical enrichment, ultimately yielding a Boolean decision about pathway enrichment/activation rather than a measure of activation intensity.
Alternative strategies have been described that focus on uncovering a characteristic “signature” of measurements that results from one or more perturbations to a biological process, and subsequently scoring the presence of that signature in additional data sets as a measure of specific activity of that process. Most previous work of this type involves identifying and scoring signatures that are correlated with a disease phenotype. These phenotype-derived signatures provide significant classification power, but the lack of a mechanistic or causal relationship between a single specific perturbation and the signature means that the signature may represent multiple distinct unknown perturbations that lead to the same disease phenotype. A number of studies, however, have focused instead on measuring causal signatures based on very specific upstream perturbations either performed directly in the system of interest, or from closely-related published data. Based on the simple, yet powerful, premise that modulation of cellular pathways and the components therein are associated with distinct signatures in downstream measureable entities, causally-derived signatures enable the “cause” of the signature to be identified with high specificity from the measured “effect.” These studies have demonstrated the great potential of applying a causal pathway scoring strategy to clinical problems, for example, by providing prognosis predictions in gastric cancer patients and indications of specific drug efficacy.
Given the vast potential of the information contained within large-scale data sets and the increasing ease at obtaining this data, it is desired to develop new ways of mining understanding from these data sets.