1. Field of the Invention
The present invention relates to bioinformatics and specifically to the representation and modeling of the function of biological systems and other systems with components that interact in a network.
2. Description of the Related Art
Researchers believe that the behavior of biological systems may be best understood as an abstract computation. In this philosophy the units of biological heredity are packets of information, and the cell's metabolic machinery is a layer of computation evolved with the goal of replicating the data stored in the hereditary material. This point of view has some antecedents in the biological literature, but is not prevalent.
Certain researchers also believe that the basic operations of a cell, the joining, transformation and splitting of molecules is strongly analogous to the process of flipping and copying bits of information in a computer's memory. It is well known to practitioners of the art of computer science and the mathematics of computation that both processes, while simple in isolation, can produce arbitrarily complex behavior when integrated into a network.
The basic interactions of metabolism (and in this disclosure, metabolism refers to all activity of proteins in the cell—not just those traditionally described as metabolic in the biological literature) differ from the basic interactions of an electronic computer because the cell has many different types of molecules, and each type carries information about the state of the cell. The electronic computer has a homogenous information carrier, the bit, and its fundamental interaction can be thought of as uniform, such as that described by the operation of a logical NAND gate. To describe biological networks, the language used must reflect the peculiar nature of the computational machinery of the cell.
The basic interactions of molecules, the reactions that are analogous to logical operations, are the binding and unbinding of molecule to molecule. Previous attempts to provide a standard notation for interaction networks represented interactions at the most basic level. This is analogous to a computer language composed of logical single-bit instructions only. The most comprehensive previous attempt to describe the function of biomolecules is Kurt W. Kohn, Molecular Interaction Map of the Mammalian Cell Cycle Control and DNA Repair Systems, Molecular Biology of the Cell, 10:2703–2734, August 1999, incorporated herein by reference in its entirety. Kohn's technique is sufficient for annotating very simple networks, and it is apparent that a higher level of description is required for compactness and readability when attempting to annotate and model complex networks.
Kohn's notation describes all the reactions in a network one by one, and differs very little from a listing of all the chemical reactions in a biological system. This method of description has shortfalls that are shared by all other existing methods, and render them incapable of representing biological function in any meaningful way.
They are over-specified—when several molecules combine, the result does not necessarily depend on the order in which they bind. In Kohn's notation, binding is a primitive object, which means that when more than two objects come together, the representation of the collection is by a sequence of bindings. This is a disadvantage, because the order of binding might not be known, and it might not be important. The description requires the user to specify more than is necessary to describe the function of the network.
Thus, even very simple systems require a very verbose description. A molecule with three binding sites which can bind three molecules independently is represented by a forest of lines in Kohn's notation (see FIG. 1 of the attached drawings). The situation is no better in other notations. For example, a listing of all the chemical reactions in chemist's notation will be just as verbose.
Additionally, existing notations are redundant. Consequently, it is impossible to say that the action of several different molecules is the same. In existing notations, we must specify the bindings again and again. This leads to overestimating the number of parameters required to describe the system because analogies between them are missed. For example, if three molecules A, B, and C are similar and bind to D in the same way, there are two parameters to describe the binding, the shared binding/unbinding rates. In a usual description we would count six different independent parameters,
Additionally, existing notations are limited in their domain of applicability—there is no way to define modules or higher order structure in existing notations, and abstract, high-level function cannot be described in the same way that low-level functions such as binding and unbinding are described.
Kohn's notation, and the more primitive notations traditionally used in biochemistry suffer from these problems because they are notations, not true languages. In this disclosure, the word “notation” will be used for systems of symbols in one-to-one correspondence with the objects they represent. One piece of a notation may be translated into meaning only by referencing a bounded number of other pieces. In the computer science literature, notations are known as finite-state-automaton languages, or regular expression languages. In the linguistics literature, they are known as type 1 languages. An example of a notation is the circuit-diagrams of electronics. These structures may be translated into circuit elements and wires by a direct map—a line is a wire, a pair of parallel lines is a capacitor, etc. The word “language” is reserved for a system of symbols where the algorithm to produce the meaning requires reference to an arbitrarily large chunk of the utterance during the process of translation. This is a qualitative, not quantitative difference, and it is well known to computer scientists. An example of a language is English, where the structure of a sentence, with it's many parts which sometimes nest deeply other times not, is hierarchical. In linguistics, these are known as type 2, type 3, and type 4 languages, and in computer science they are known as context-free pushdown automata languages, context-dependent pushdown-automaton languages, and general automaton languages. Each of these classes is a superclass of the previous one, so that a context-dependent pushdown automaton language is a special case of a context dependent language. For purpose of clarity, we will call a system of symbols which requires an automaton which is more complicated than a finite-state automaton a language. A language is different from a notation because, among other things, a language requires more than just a map of symbols to meanings to be understood. A language requires a recursive algorithm to comprehend it. A language is a “formal language” when the algorithm to understand all its utterances is known and may be translated by a mechanical or electronic device, such as a general purpose computer. English is not a formal language in this sense, but the computer language FORTRAN is. Unless indicated otherwise, the word language will mean a formal language which requires more than a finite state automaton to parse.