The present invention relates generally to the creation of a network model of a dynamic system of interdependent variables from observed system transitions. More particularly, the present invention creates a network model from observed data by calculating a probability distribution over temporally dependent or high-order Markovian, Boolean or multilevel logic network rules to represent a dynamic system of interdependent variables from a plurality of observed, possibly noisy, state transitions.
Genetic regulatory networks (GRN) model systems of interdependent variables which change over time. A GRN comprises a plurality of variables, a system state defined as the value of the plurality of variables, and a plurality of regulatory rules corresponding to the plurality of variables which determine the next system state from previous system states. GRNs can model a wide variety of real world systems such as the interacting components of a company, the conflicts between different members of an economy and the interaction and expression of genes in cells and organisms.
Genes contain the information for constructing and maintaining the molecular components of a living organism. Genes directly encode the proteins which make up cells and synthesize all other building blocks and signaling molecules necessary for life. During development, the unfolding of a genetic program controls the proliferation and differentiation of cells into tissues. Since the function of a protein depends on its structure, and hence on its amino acid sequence and the corresponding gene sequence, the pattern of gene expression determines cell function and hence the cell""s system state and the rules by which the state is changed.
In a GRN representing the interaction and expression of genes in cells and single-cell organisms, the variables represent the activation states of the genes, measured by the number of messenger RNA (MRNA) transcripts of the gene made per unit time or the number of proteins translated from the mNRAs per unit time. The regulatory rules are determined by the transcription regulatory sites next to each gene and the interactions between the gene products and these sites. The binding of molecules to these sites in various combinations and concentrations determines the degree of expression of the corresponding gene. Since these molecules are proteins or RNA""s made by other genes, the network rules are functions of the activation states of the genes which they control. Genes are constantly exposed to varying concentrations of these controlling substances, so such a system can be considered as a GRN with an asynchronous, continuous time update rule.
The GRN of a system is the most critical information needed to diagnose and control it. In a complex system with many interacting components and rules, it is generally impossible to tell which components are the most critical without knowing its GRN. FIG. 1 illustrates this concept with a simple genetic regulatory network with N=3 genes. The arrows show functional dependence. The genes are updated continuously with gene A dependent on time and genes B and C dependent on the gene states. The update rules are A(t)=sin(wt)+0.5, B(t)=C(t)A(t), C(t)=exe2x88x92A(t). A naive view of the system in FIG. 1 might be that inhibiting the activation of A will lessen the activation of B since B is activated in proportion to A""s activation. However, inhibiting A activates C, which in turn increases the activation of B. In fact, as A is activated from being off, B is first activated with increasing A and then, past a threshold value of A, deactivated with increasing A. Accordingly, without knowledge of the GRN, knowledge of the direct relationship between two genes is insufficient to see how one influences the other.
Identifying the GRN representing the interaction and expression of genes in a class of cells is of fundamental importance for medical diagnostic and therapeutic purposes. For example, normal and cancerous cells may have identical surface markers and surface receptors and can be difficult to distinguish with chemotherapeutic agents. A GRN model of the interaction and expression of genes in the cells can indicate functional differences between normal and cancerous cells that provide a basis for differentiation not dependent on cell surface markers. The GRN also provides a means to identify the receptors or genetic targets to which molecule design techniques such as combinatorial chemistry and high throughput screening should be directed to achieve given functional effects. Such techniques are frequently used now, and pharmaceutical and biotechnology companies suffer from uncertainty as to which targets and receptors are worthy of study. The approach described below can greatly assist in this process. See Gene Regulation and the Origin of Cancer: A New Method, A. Shah, Medical Hypothesis (1995) 45,398-402 and Cancer progression: The Ultimate Challenge, Renato Dubbecco, Int. J. Cancer: Supplement 4, 6-9 (1989).
Generally, one does not have direct knowledge of the GRN regulatory rules and can only view the system states over time or after controlled interventions. The problem of inferring the regulatory rules from a succession of observed system states is known as the inverse genetic regulatory network problem. The inverse genetic regulatory network problem presents many difficulties. First, genes are expressed to different degrees and their protein and mRNA products vary over a wide range of concentrations. Second, genes express their products at varying times after their activation. Third, genes are activated at any time, not at regular time intervals. Fourth, the genes in a genome are not activated simultaneously, whatever the regularity of activation time intervals. Fifth, genetic systems have many influences which are unknown nor cannot be modeled. Sixth, the interaction and expression of genes in cells and organisms are often stochastic processes. Seventh, measurement error often hinders proper measurement of the system states. Eighth, in such large systems as mammalian genomes, it is not presently possible to measure the expression levels of all genes or relevant molecules. Likewise, it is not possible to acquire state transition information by putting cells into arbitrary genetic expression system states to observe their reactions, as most gene expression patterns will kill a cell. Thus, the inverse GRN problem can be extremely difficult.
Previous solutions to the inverse genetic regulatory network problem address some of these difficulties to varying degrees. One solution deals with the asynchronous nature of the system by modeling rule behaviors with xe2x80x9clogical parameters,xe2x80x9d which serve to determine the relative strength and, hence, activation order of the network rules. See E. H. Snoussi and R. Thomas, Logical identification of all steady states: The concept of feedback loop characteristic states, Bul. Math. Biol., 55:973-991, (1993). While logical parameters mixed with multilevel logic works well at modeling asynchronous system with time lags, they cause the discrete time steps of the model to correspond to different actual time periods, depending on which gene functions are being activated. This effect makes converting the number of transitions to actual elapsed time difficult or impossible and limits the utility of this approach in analyzing real data.
Other approaches can create complete GRNs when only a fraction of the state transitions are observed. One such approach assumes a particular form of the regulatory rules and modifies them in a particular fashion until they match the observed transitions. See, R. Somogyi, S. Fuhrman, M. Ashkenazi, and A. Wuensche. The gene expression matrix: Towards the extraction of genetic network architectures. In Proc. Of the Second World Congress of Nonlinear Analysis (WCNA96). Elsevier Science, 1996. Another such approach uses the xe2x80x9cmutual informationxe2x80x9d between sites, a statistical measure of how the value of one site dictates that of another, to build a map of the network rules. See, S. Liang, REVEAL, A general reverse engineering algorithm for inference of genetic network architectures. In Pacific Symposium on Biocomputing, Vol. 3, pp. 18-29, 1998. However, these approaches are unable to distinguish between known and unknown transitions. With the limited data available from genetic experiments, the vast majority of transitions between states will be unknown; and these methods will yield transitions with many incorrect, tacitly assumed transitions from those not in the data. Hence, this inability leads to dangerously misleading results when applying the resulting network, such as when inferring the controlling regulatory sites in a cancerous cell.
Accordingly, there exists a need for a method to create network models of dynamic systems of interdependent variables from observed system transition that can:
(1) operate with data consisting of only a fraction of all possible transitions,
(2) accommodate measurement error on these transitions
(3) produce a probability distribution over network functions (rather than simply giving one set of network functions that match the data)
(4) support asynchronous activation times for different genes
(5) support varying delays between gene activation and gene expression for different genes
(6) accommodate the stochastic nature of gene network operations
(7) support varying degrees of gene activation (not simply Boolean activation states).
The present invention provides a method to create a Boolean or multilevel logic network model of a dynamic system of interdependent variables from observed system states transitions that can
(1) operate with data consisting of only a fraction of all possible transitions,
(2) accommodate measurement error on these transitions
(3) produce a probability distribution over network functions (rather than simply giving one set of network functions that match the data)
(4) support asynchronous activation times for different genes
(5) support varying delays between gene activation and gene expression for different genes
(6) accommodate the stochastic nature of gene network operations
(7) support varying degrees of gene activation (not simply Boolean activation states) and
(8) incorporate prior knowledge of the nature and limitations of the actual network functions being modeled.
It is an aspect of the present invention to provide a method for creating a network model of a real world system of interacting components having a plurality of expression levels from prior knowledge and a plurality of observations of the real world system comprising the steps of:
defining N network variables xi, i=0, . . . Nxe2x88x921 having values, vi, i=1, . . . m to represent the components of the real world system, said values defining a network state of said network model;
defining N network rules ƒi, i=0 . . . Nxe2x88x921 corresponding to said N network variables from a space of possible network rules wherein said N network rules have outputs defining said network model;
defining N prior distributions corresponding to said N network rules expressing probabilities of said possible network rules from the prior knowledge of the real world system;
defining at least one likelihood function expressing the probability of making the system observations for said possible network rules; and
defining N posterior distributions as a product of said prior distributions and said at least one likelihood function wherein said posterior distributions express the probabilities of said possible network rules given the prior knowledge and the plurality of observations of the real world system.
It is a further aspect of the present invention to provide a method wherein said defining N prior distributions step comprises the step of defining said N prior probability distributions as an expression of a plurality of abstract properties of said N network rules.
It is a further aspect of the present invention to provide a method for creating a network model of a real world system of interacting components having a plurality of expression levels from prior knowledge and a plurality of observations of the real world system comprising the steps of:
defining N network variables xi, i=0, . . . Nxe2x88x921 having values, vi, i=1, . . . m to represent the components of the real world system, said values defining a network state of said network model;
defining N activation delays Ti0xe2x86x921, i=0 . . . Nxe2x88x921, corresponding to said N network variables xi, i=0, . . . Nxe2x88x921 from a space of possible activation delays; and
defining N deactivation delays Ti1xe2x86x920, i=0 . . . Nxe2x88x921, corresponding to said N network variables xi, i=0, . . . Nxe2x88x921 from a space of possible deactivation delays;
defining N network rules ƒi, i=0, . . . Nxe2x88x921 corresponding to said N network variables from a space of possible network rules wherein said N network rules have outputs defining said network model; and
defining at least one posterior distribution wherein said posterior distribution expresses the probabilities of said possible network rules, said possible activation delays and said possible deactivation delays given the prior knowledge and the plurality of observations of the real world system.