Traditionally, cell biology research has largely been a manual, labor intensive activity. With the advent of tools that can automate much cell biology experimentation (see for example, U.S. patent application Ser. Nos. 5,989,835 and 6,103,479), the rate at which complex information is generated about the functioning of cells has increased dramatically. As a result, cell biology is not only an academic discipline, but also the new frontier for large-scale drug discovery.
Cells are the basic units of life and integrate information from Deoxyribonucleic Acid (“DNA”), Ribonucleic Acid (“RNA”), proteins, metabolites, ions and other cellular components. New compounds that may look promising at a nucleotide level may be toxic at a cellular level. Florescence-based reagents can be applied to cells to determine ion concentrations, membrane potentials, enzyme activities, gene expression, as well as the presence of metabolites, proteins, lipids, carbohydrates, and other cellular components.
Innovations in automated screening systems for biological and other research are capable of generating enormous amounts of data. The massive volumes of data being generated by these systems and the effective management and use of information from the data has created a number of very challenging problems.
To fully exploit the potential of data from high-volume data generating screening instrumentation, there is a need for new informatic and bioinformatic tools. As is known in the art, “bioinformatic” techniques are used to address problems related to the collection, processing, storage, retrieval and analysis of biological information including cellular information. Bioinformatics is defined as the systematic development and application of information technologies and data processing techniques for collecting, analyzing and displaying data obtained by experiments, modeling, database searching, and instrumentation to make observations about biological processes.
Recent advances in the automation of molecular and cellular biology research including High Content and High Throughput Screening (“HCS” and “HTS,” respectively), automated genome sequencing, gene expression profiling via complementary DNA (“cDNA”) microarray and bio-chip technologies, and protein expression profiling via mass spectrometry and others are producing unprecedented quantities of data regarding the chemical constituents (i.e., proteins, nucleic acids, and small molecules) of cells relevant to health and disease.
There are several problems associated with analyzing chemical constituent data generated by automated screening systems. One problem is that there is a major bottleneck in the analysis and application of such data. Tasks such as pharmaceutical research typically require knowledgeable experts (i.e., molecular and cellular biologists) to place such data within a “biological context.” For example, given a gene expression profile indicating that expression of Gene X is inhibited in cells treated with Compound Y, this datum becomes significant for the drug discovery process only upon inspection by a cell biologist who is able to reason: “I know that the protein coded for by Gene X affects Protein Z, the over-activity of which underlies disease A. Therefore, these data indicate that Compound Y may prove useful as a drug for the treatment of disease A.” Such reasoning is also called an “inference.”
Such reasoning requires detailed knowledge of the sequences of physico-chemical interactions between molecules in cells (i.e., the cell biologist must know that the protein encoded by Gene X affects Protein Z). Such “manual” assessment of data's significance is becoming more and more unworkable as the rate of data production continues to increase.
Another problem is that analysis of biological data in light of molecular interactions is not easy to automate. Given a suitable electronic database of known physico-chemical interactions between molecules in cells, much of this manual inspection and reasoning could be automated, increasing the efficiency of tasks such as drug discovery and genetic analysis. However as currently practiced in the art, constructing such a database would be an “expert systems engineering” task, requiring domain experts to enter into the database their explicit and implicit knowledge regarding known interactions between biological molecules.
As is known in the art, an “expert system” is an application program that makes decisions or solves problems in a particular field, such as biology or medicine, by using knowledge and analytical rules defined by experts in the field. An expert system typically uses two components, a knowledge base and an inference engine, to automatically form conclusions. Additional tools include user interfaces and explanation facilities, which enable the system to justify or explain its conclusions. “Manual expert system engineering” includes manually applying knowledge and analytical rules defined by experts in the field to form conclusions or inferences. Typically, such conclusions are then manually added to a knowledge base for a particular field (e.g., biology).
In the human genome alone there are approximately 100,000 genes, encoding a like number of proteins (i.e., each of which may occur in several distinct forms due to splice variants and covalent modifications). In addition there are a large but unknown number (e.g., thousands to tens of thousands) of different small organic molecules whose interactions with each other and with proteins and nucleic acids should also be represented in a comprehensive physico-chemical interaction database. It is very difficult to determine with any degree of certainty the total number of such interactions, or even the number of currently known interactions. However the combinatorial problem presented by numbers of this magnitude prevents development of truly comprehensive and up-to-date biomolecule interaction databases when their construction is approached as an expert system engineering task based on direct input of knowledge by experts. As is known in the art, a “combinatorial problem” is a problem related to probability and statistics, involving the study of counting, grouping, and arrangement of finite sets of elements.
There have been attempts to create databases including biomolecule interactions with inferences via the manual “expert systems engineering” approach. However, such expert systems currently elect to severely restrict the scope of their coverage (e.g., to a few tens or hundreds of “key” proteins, or to the biomolecules of only the simplest organisms, such as bacteria and fungi, whose relatively small genomes encode many fewer proteins than does the human genome). In addition such manual expert systems typically make little, if any, effort to incorporate new information in a timely fashion.
Such expert system engineering approaches include, for example: (1) Pangea Systems Inc.'s (1999 Harrison Street, Suite 1100, Oakland, Calif. 94612) “EcoCyc database.” Information on this database and the other databases can be found on the Internet. This database's coverage in general includes basic metabolic pathways of the 10 bacterium, E. coli; (2) Proteome Inc.'s (100 Cummings Center, Suite 435M, Beverly, Mass. 01915) “Bioknowledge Library” This is a suite of databases of curated information including in general sequenced genes of the yeast, S. cerevisiae, and the worm, C. elegans. A number of well-established protein-protein interactions are included; and (3) American Association for the Advancement of Science's (1200 New 15 York Ave. NW, Washington, D.C. 20005) “Science's Signal Transduction Knowledge Environment”. This connections map database seeks to document some of the best-established biomolecular interactions in a select number of signal transduction pathways.
However, such selected databases and others known in the art, take a manual “expert system engineering” approach or semi-automated approaches to populating the databases (e.g., human authorities manually input into a database their individual understandings of the details of what is known regarding individual biomolecular interactions.)
Some of these problems have been overcome in co-pending application Ser. No. 09/769,169, entitled “Method and system for automated inference of physico-chemical interaction knowledge via co-occurrence analysis of indexed literature databases,” assigned to the same Assignee as the present application.
However, it is also highly desirable to automatically construct logical associations from the inferences created via co-occurrence analysis of indexed literature databases, to represent a temporal sequence of physico-chemical interactions actually used by living cells to regulate or to achieve a biological response. In molecular cell biology, such a temporal sequence of physico-chemical interactions is called a biological or cell “pathway.”
There have been attempts to collect and store data associated with biological pathways. Such attempts include for example, “Ecocyc” from Pangea (see, e.g., Nucleic Acids Research 26:50-53 (1998), Ismb 2:203-211 (1994)); “KEGG” pathway database from Institute for Chemical Research, Kyoto University (see, e.g., Nucleic Acids Research 27:377-379 (1999), Nucleic Acids Research 27:29-34 (1999)); “CSNDB” links to from Japanese National Institute of Health Sciences (see, e.g., Pac Symp. Biocomput 187-197 (1997)); “SPAD” from Graduate School of Genetic Resources Technology, Kyushu University, Japan; “PUMA” now called “WIT” from Computational Biology in the Mathematics and Computer Science Division at Argonne National Laboratory; and others. However, such pathway databases typically do not use automated co-occurrence analysis of indexed literature databases to represent a temporal sequence of physico-chemical interactions.
Thus, it is desirable to automatically determine temporal sequences of physico-chemical interactions with co-occurrence analysis of indexed literature databases that can be used to determine biological pathways. Such an approach should help permit the construction of comprehensive databases of knowledge concerning temporal sequences of physico-chemical interactions to determine biological pathways.