An important challenge in the understanding of biological systems and in the development of diagnostics, prognostics, and predicted drug responses for complex, multi-factorial diseases is the identification and validation of biomarker profiles or “surrogate markers.” It appears that biomarker patterns or sets of biomolecules have more information content than single markers in many contexts. Development of such profiles will permit, among other utilities, a physician to characterize and diagnose homeostasis or disease states in his patients. Typically, molecules from multiple levels of molecular biology, e.g., the polynucleotide (DNA or RNA), polypeptide, and metabolite levels, of the biological system under study, e.g., a human, will be considered simultaneously in the analysis.
The protocols for finding such biomarkers broadly involve analysis of the biomolecular content of a sample such as a body fluid or stool in a number of patients including (1) a test group which has been diagnosed to in fact be in the biological state under study (e.g., suffering from a disease such as cirrhosis of the liver, or having a successful response to an experimental drug); and (2) a control group matched as closely as possible to the test group in terms of sex, race, diet, etc., but known not to be in the biological state under study. The analytical procedure is designed to find, ideally, one, but typically a plurality of biomolecules in the biological samples from patients in the test group that are not present in the control group, or perhaps more typically, that vary reliably in abundance as compared with the control group. This one or set of biomolecules, or some relationship among the concentrations of the biomolecules, is taken as a candidate biomarker or profile of the disease. This in turn is validated by analysis of patients with and without the disease to determine the sensitivity and specificity of the marker.
In practice, many such putative profiles or biomarkers upon validation testing are found to have poor sensitivity or specificity or both, often so poor as to be essentially worthless as a basis for a diagnostic or prognostic test. There are many reasons for this rooted in the complexity of human physiology and biochemistry. The markers may or may not be closely connected to the biology of the disease, and therefore often are unreliable. However, if such empirically determined profiles could be examined for biological state relevance before the costly validation step, the discovery of truly informative biomarker profiles would be facilitated. Furthermore, if a system could be devised that can discern the fundamental difference between biological states of a cell, tissue, organ, organ system, or organism such as diseased and healthy states, then the biomolecular changes discovered to be inherent in the change from a healthy state to a diseased state could be used as clues to what biomolecular changes should occur in the blood (or urine, saliva, stool, csf, tears, etc.). This would provide a theoretical basis for biomarker development, and remove it from the purely empirical realm.
The amount of biological information currently generated per unit time is increasing dramatically. It is estimated that the amount of information now doubles every four to five years. Because of the large amount of information that must be processed and analyzed, traditional methods of analyzing and understanding the meaning of information in the life science-related areas are breaking down. Statistical techniques, while useful, do not provide a biologically motivated explanation of function.
There are ongoing attempts to produce electronic models of biological systems designed to facilitate biological analysis. These involve compilation and organization of enormous amounts of data, and construction of a system that can operate on the data to simulate the behavior of a biological system. Because of the complexity of biology, and the sheer numbers of data, the construction of such a system can take hundreds of man years and multiple tens of millions of dollars. Furthermore, those seeking new insights and new knowledge in the life sciences are presented with the ever more difficult task of selecting the right data from within mountains of information gleaned from vastly different sources. Such knowledge bases if enabled could be used to discern the fundamental difference between, for example, diseased and healthy states of a tissue, organ or organism, a successfully or unsuccessfully drugged organism, or a person who will benefit from a drug and one who will not, and theoretically could be valuable in the task of biomarker discovery.
One useful development in this area is disclosed in co-pending U.S. application Ser. No. 10/644,582 filed Aug. 20, 2003 (U.S. patent application Publication Number US2005-0038608A1) entitled “System, Method and Apparatus for Assembling and Mining Life Science Data,” the disclosure of which is incorporated herein by reference. This application discloses and enables exploitation of a new paradigm for the recording, organization, access, and application of life science data. The method and program enable establishment and ongoing development of a systematic, ontologically consistent, flexible, optimally accessible, evolving, organic life science knowledge base which can store biological information of many different types, from many different sources, and represent many types of relationships within the life science information. Furthermore, the knowledge base places life science information into a form that exposes the relationships within the information, facilitates efficient knowledge mining, and makes the information more readily comprehensible and available. This knowledge base is structured as a multiplicity of nodes indicative of life science knowledge using a life science taxonomy. Relationship descriptors are assigned to pairs of nodes that correspond to a relationship between the pair, and may themselves comprise nodes. A very large number of nodes are assembled to form the electronic data base, such that every node is joined to at least one other node. It was envisioned that the knowledge base could eventually incorporate the entirety of human life science knowledge from its finest detail to its global effect, and incorporate an endless diversity of biological relationships in thousands of other organisms. Such a life science knowledge base can be used in a manner similar to a library, permitting researchers, physicians, students, drug discovery companies, and many others to access life science information in a way that enhances the understanding of the information, but is far more powerful as a research resource. Small portions of the knowledgebase may be represented graphically as a web of interrelated nodes, but for any significantly biological system, these are beyond rational comprehension because of their complexity.
A second valuable development came from the realization that querying this knowledge base in its holistic form to determine cause and effect relationships in a particular biological space was sometimes cumbersome, as the knowledgebase included vast amounts of data wholly unrelated to the space under investigation. This led to development of a second invention disclosed and claimed in co-pending U.S. application Ser. No. 10/794,407, filed Mar. 5, 2004 (U.S. patent application Publication Number US2005-0154535A1), entitled “Method, System and Apparatus for Assembling and Using Biological Knowledge,” the disclosure of which also is incorporated herein by reference. This application discloses and enables production of sub-knowledge bases and derived knowledge bases (called “assemblies”) from a global knowledge base by extracting a potentially relevant subset of life science-related data satisfying criteria specified by a user as a starting point, and reassembling a specially focused knowledge base. These then are refined and augmented, and then may be probed, displayed in various formats, and mined using human observation and analysis and using a variety of tools to facilitate understanding and revelation of hidden or subtle interactions and relationships in the biological system they represent, i.e., to produce new biological knowledge.
Another valuable group of inventions are disclosed and claimed in co-pending U.S. application Ser. No. 10/992,973, filed Nov. 19, 2004 (U.S. patent application Publication Number US2005-0165594A1), the disclosure of which is incorporated herein by reference. This application discloses a group of tools for use with the global knowledge base or with an assembly which facilitate hypothesis generation. The tools and methods perform logical simulations within a biological knowledge base and permit more efficient execution of discovery projects in the life sciences-related fields. Logical simulation resembles reasoning in many respects and includes backward logical simulations upstream of cause and effect relationships, which proceeds from a selected node upstream through a path, typically comprising multiple branches, of relationship descriptor nodes to discern a node or group of nodes representing a biomolecule or activity which is hypothetically responsible for an experimentally observed or hypothesized change in the biological system. In short, this type of computation answers the question “What could have caused the observed change?” Logical simulation also includes forward simulations, downstream of cause and effect relationships, which travel from a target node downstream through a path of relationship descriptors to discern the extent to which a perturbation of the target node causes experimentally observed or hypothetical changes in the biological system. The logical simulation travels through a path of relationship descriptors containing at least one potentially causative node or at least one potential effector node to discern a pathway hypothetically linking the target nodes. This in turn permits the generation of new hypotheses concerning biological pathways based on the biological knowledge, and permits the user to design and conduct biological experiments involving biomolecules, cells, animal models, or a clinical trial to validate or refute a hypothesis. The set of these paths comprise explanations for perturbations of the target nodes which hypothetically could be caused by perturbations of the source nodes. The perturbation is induced, for example, by a disease, toxicity, drug reaction, environmental exposure, abnormality, morbidity, aging, or another stimulus.
When an investigation is based on a hypothesized relationship or on an experimentally observed relationship between distinct biological elements, and the goal is to understand the underlying biochemistry and molecular biology causative of the relationship, it often will be the case that numerous potentially explanatory paths will emerge from an in silico analysis. Thus, the foregoing and potentially other related software based biological system analysis techniques can result in a large number of hypotheses including hypotheses that are mutually exclusive, and many which may in fact not be representative of real biology. This is not surprising in view of the extreme complexity of biological systems.
A method utilizing the foregoing technology in a novel way to conduct causal analysis in complex biological systems is disclosed and claimed in copending U.S. application Ser. No. 11/390,496 filed Mar. 27, 2006 U.S. patent application Publication Number US2007-0225956A1), titled “Causal Analysis in Complex Biological Systems,” the disclosure of which is incorporated by reference. That application provides software implemented methods of discovering active causative relationships in the biology, e.g., molecular biology, of complex living systems. The method is practiced within the domain of systems biology and is designed to discover the web of interactions of specific biological elements and activities causative of a given biological response or state. It may be practiced using a suitably programmed general purpose computer having access to a biological data base of the type disclosed herein.
The problem solved by this method may be analogized to the task of finding the right pathways within a vast, multi dimensional array or web of selectively interconnected points respectively representing something about a biological molecule or structure, its various activities, its structural variants, and its various relationships with other points to which it connects. A connection indicates that there is a relationship between the two points and optionally the directionality of the relationship, e.g., the node “kinase activity of protein P” might be linked to “quantity of phosphorylated form of protein S”, protein P's substrate, by indicia of directionality, indicating node “kaProtP” influences “PhosProtS”, and not vice versa. Suppose also that from an observation, it is known that when drug A is administered, it inhibits protein T, and induces a given biological state or states in the organism, e.g., reduced secretion of stomach acid, and in some subjects, induces the onset of inflammatory bowel disease. The question: “what is the mechanism of the effects?” involves finding the pathways within this vast network of connected points that best explain the data, and are most likely to represent real biology. There may be thousands or millions of potential such pathways in a knowledge base, and a large number even in a well targeted assembly.
Generally, the method of the '496 application comprises mapping operational data onto a knowledge base, preferably an assembly, of the type described therein to produce a large number of models—chains defining branching paths of causality propagated virtually through the knowledge base—and applying a series of algorithms to reject, based on various criteria, all or portions of the models judged not to be representative of real biology. This pruning or winnowing process ultimately can result in one or a small number of models which underlie an explanation of the operational data, i.e., reveals causative relationships that can be verified or refuted by experiment and can lead to new biological knowledge.
The method comprises the steps of first providing a knowledge base of biological assertions concerning a selected biological system. The knowledge base comprises a multiplicity of nodes representative of a network of biological entities, actions, functional activities, and biological concepts, and links between nodes indicative of there being a relationship therebetween, at least some of which include indicia of causal directionality. The knowledge base of the above mentioned '582 application; or preferably an assembly of the type disclosed in the above mentioned '407 application targeted to the selected biological system, are examples of such knowledge bases.
The purpose of the system is to aid in the understanding of the biochemical mechanisms explanatory of a data set, herein referred to as “operational data.” Operational data is data representative of a perturbation of a biological system, or characteristic of a biological system in a particular biological state, and comprises observed changes (observational data) in levels or states of biological components represented by one or more nodes, and optionally hypothesized changes (hypothetical data) in other nodes resulting from the perturbation(s). The operational data can comprise an effective increase or decrease in concentration or number of a biological element, stimulation or inhibition of activity of an element, alterations in the structure of an element, the appearance or disappearance of an element or phenotype, or the presence or absence of a SNP or allelic variant of a protein. Typically, the operational data is experimentally determined data, i.e., is generated from “wet biology” experiments. Preferably, all of the biological elements recorded as increasing or decreasing, etc., in the operational data are represented in the knowledge base or assembly.
Thus plural models or chains, i.e., paths along connections or links and through nodes within the data base, are identified by software. This typically is done by simulating in the network one or more perturbations of multiple individual root nodes (or starting point nodes) to initiate a cascade of activity through the relationship links along connected nodes preferably to an intermediate or most preferably a terminal node that is representative of a biological element or activity in the operational data. This process produces plural (often 104, 105 or more) branching paths within the knowledge base potentially individually representing at least some portion of the biochemistry of the selected biological system.
These branching paths constituting models are prioritized by applying algorithms to the models which estimate how well each model predicts the operational data. This is done by mapping the operational data onto each candidate model and counting the number of nodes in the model that are representative of, and/or correspond to, elements represented in the operational data.
This results in definition of a smaller set of branching paths comprising hypotheses potentially explanatory of the molecular biology implied by the data. Typically, after such a screening via the mapping algorithm(s), there still are many such branching paths, often hundreds or thousands, depending on the granularity of the assembly or of the knowledge base, on the question in focus, on the prioritization criteria, and on other factors.
The foregoing steps of generating, mapping and prioritizing pathways can be conducted in any order. For example, the software may first map the operational data onto the assembly, then search for branching paths and keep a ranking based on the amount of data correctly simulated, or it may be designed to first identify all possible paths involving a given data point, then map remaining data onto each path and prioritize as mapping proceeds, etc. Preferably, for efficiency, some or all of the operational data is mapped onto the knowledge base or assembly before raw path finding commences, and the paths discerned are constrained to paths which intersect a node corresponding to or at least involved with the data.
At this point, the system has identified a large number of hypotheses, represented as branching paths or models, each of which potentially explain at least some portion of the operational data. The next step in the method is to apply logic based criteria to each member of the set of models to reject paths or portions thereof as not likely representative of real biology. This “hypothesis pruning” leaves one or a small number of remaining models constituting one or more new active causative relationships.
As nonlimiting examples, the logic based criteria may be based on:                A measure of consistency between the predictions resulting from simulation along a model and known biology (e.g., not involving the        Using as a filter a group of models generated by mapping against random or control data to eliminate models from the set of models.        An assessment of descriptor nodes associated with each model for consistency with known aspects of the biology of the selected biological system. For example, the assessment may be based on mutual anatomic accessibility of the nodes representing entities in a given branching path, and answers the question: are all biological elements in the path known to be accessible in vivo to its connected neighbors?        A measure of consistency between the operational data and the predictions resulting from simulation along a branching path, and may seek to answer questions such as: does the perturbation of the root node correspond to the operational data, e.g., the observed wet biology data under examination? Does this path which contains, e.g., 7 nodes corresponding to operational data points, predict their increase or decrease consistently with the operational data? What is the number of nodes perturbed in a linear path comprising a portion of a branching path which correspond to the operational data?        A determination of a pair, triad or higher number of branching paths which together best correlate with the operational data. Optimal combinations may be determined by applying combinatorial space search algorithms, such as a genetic algorithm, simulated annealing, evolutionary algorithms, and the like, to the multiple branching paths using as a fitness function the number of correctly simulated data points in the candidate path combinations.        Whether a branching path comprises linear paths wherein plural nodes are perturbed in the same direction as the operational data, or comprising multiple connections to concept nodes, e.g. to nodes representing complex biological conditions or processes under study such as apoptosis, metastasis, hypoglycemia, inflammation, etc.        
The method may comprise the additional step of harmonizing a plurality of remaining paths to produce a larger path, to select a subgroup of paths, or to select an individual path comprising a model of a portion of the operation of a the biological system. “Harmonizing” means that plural branching paths are combined to provide a more complete or more accurate model explanatory of the operational data, or that all branching paths except one are eliminated from further consideration.
The method may further comprise the step of simulating operation of the model to make predictions about the selected biological system, for example, to select biomarkers characteristic of a biological state of the selected biological system, or to define one or more biological entities for drug modulation of the system.
The method can be practiced by applying a plurality of logic based criteria to the set of branching paths to approach one or more hypotheses representative of real biology. This approach may employ a scoring system based on multiple criteria indicative of how close a given hypothesis/branching path approaches explanation of the operational data. Collectively, the various features of the hypothesis pruning protocols enable identification of one or more hypotheses which approach known aspects of the biology of the selected biological system and the biological change under study.