Network reconstruction has become an important focus of current research as it impacts many areas of current consideration, such as supply chains, social networks, food chains and biological systems, such as metabolic and gene regulatory networks. For example, companies that manufacture goods will depend on a supply chain for both raw materials and subcomponents of the final manufactured product (or perhaps service). The network of suppliers may be a guarded secret of the company and not readily apparent to an outside observer. However, it may be desirable to be able to determine the network of suppliers.
Regarding biological systems, a fundamental task has been to try to understand how networks of genes interact to bring about cellular function. Pursuing this task requires both an understanding of the biology of the individual genes and gene products, as well as an understanding of the properties of complex networks. For an example illustrating the complexities of network theory, see S. H. Strogatz, Exploring Complex Networks, 410 NATURE 268-76 (2001).
The topic of reverse engineering chemical and biological networks based on protein to protein interactions, medical literature or time-series data from chemical reaction or gene expression experiments has been a subject of recent study. These efforts have sought to reconstruct network connectivity, and in some cases the kinetic relations, of the system under study. Four hierarchical levels of reverse engineering can be defined.
Topology is a level of reverse engineering concerned with identifying which nodes interact in the system, the goal being to map or diagram non-directional connections between all interacting nodes. Some examples of methods at this level include literature-based networks and protein to protein interaction maps based on yeast two-hybrid studies.
Topology and causality is a level of reverse engineering encompassing topology, and further encompassing the directionality of the interactions. The goal of this level is to map or diagram the directionality between all directly interacting variables. An example of a method at this level includes mutual information-based reconstructions of mitochondrial metabolic reactions.
Qualitative connections is a level of reverse engineering encompassing topology and causality and also providing a qualitative description of the interactions. More specifically, this method seeks to know all variables that can modify an output variable with a qualitative indicator of how the output will change, i.e., positively or negatively. Some examples of methods at this level include fuzzy logic analysis of facilitator/repressor groups in the yeast cell cycle and a Jacobian matrix elements method for chemical reactions.
Quantitative connections is a level of reverse engineering encompassing qualitative connections and also providing a quantitative description of the interactions. More specifically, for any given variable, this method seeks to know the mathematical relationship that maps its output as a function of the input. The goal of this level of reverse engineering is a set of equations that could simulate and reproduce the behavior of the actual system. Some examples of methods at this level include linear models of gene regulation and genetic algorithms for reconstructing synthetic data generated from E-cells, an in silico representation of an Escherichia coli cell.
At the qualitative connections level, methods for deducing chemical kinetic systems are described in M. Samoilov et al., On the Deduction of Chemical Reaction Pathways From Measurements of Time Series of Communications, 11 CHAOS 108-114 (2001). In these methods, some inputs are affected while mutual information is computed between all pairs of reactant concentrations, i.e., the nodes in the network. These methods attempt to determine the whole reaction scheme, or map, by considering which nodes are “closer” or “farther” to each other in a metric-type space, determined by correlating the nodes.
A potential difficulty may arise in such methods when all reactant concentrations cannot be determined to the same degree of accuracy. Thus, an encompassing, global reconstruction may not be feasible when one cannot obtain accurate data for all nodes. Moreover, with such methods the input signal must result in the perturbation of all the nodes such that the correlations reveal the unknown connections between the nodes. The methods, however, do not provide for determining such an input signal for arbitrary (random) and/or hidden networks.
Another common method of reconstructing biological networks is a Bayesian inference approach, such as that used to analyze gene expression data. See Friedman et al., Using Bayesian Networks to Analyze Expression Data, 7 J. COMPUT. BIOL. 601-620 (2000). The Bayesian inference approach attempts to construct a model that can “explain” data based on conditional probabilities between upstream nodes called “parents” and their dependent “child” nodes. The complete dependency tree is an acyclic graph that maps connections between nodes and hence reconstructs a network.
Some properties of the Bayesian approach make it hard to apply to certain systems, such as biological systems. For example, the Bayesian approach assumes that an infinite number of models can explain the data, and only the probability of any given model being the correct model can be determined. Whereas one may simply accept the most probable model, in practice, data may be quite noisy and several very different models may have essentially the same ability to explain the data. Hence, a unique solution would not be found.
Bayesian networks, in principle, can handle continuous value variables, found for example in biological networks. However, in practice, data (i.e., mRNA levels) must be discretized to allow for the computation of joint probabilities between input variables. The optimal discretization method is not easily discernible and must balance more faithful representations of the input data (many fine bins) versus better estimations of joint probabilities (fewer large bins). Another problem may arise if feedback loops exist in the biological system, because the inferred Bayesian networks must be acyclic and hence cannot represent loops. In principle, this can be solved with dynamic Bayesian networks that can “unroll” loops. However, in practice, the amount of data needed to pursue this approach is currently unfeasible. Indeed, current approaches have considerable trouble constraining static Bayesian networks, and the use of dynamic Bayesian networks for biological data has not been reported.
While some methods exist to reconstruct networks, these methods have some important limitations in the topologies of the networks that can be reconstructed, in the amount of and/or difficulty in collecting required data and in the uniqueness of the reconstructed solutions. Thus, it would be desirable to have techniques for reverse engineering arbitrary and/or hidden networks, such as supply chain and biological networks, accurately and efficiently with data sets that can be reasonably collected.