The advent of high-throughput experimental technologies for molecular biology have resulted in an explosion of data and a rapidly increasing diversity of biological measurement data types. Examples of such biological measurement types include gene expression from DNA microarray or Quantitative PCR experiments, protein identification from mass spectrometry or gel electrophoresis, cell localization information from flow cytometry, phenotype information from clinical data or knockout experiments, genotype information from association studies and DNA microarray experiments, etc. This data is rapidly changing; new technologies frequently generate new types of data. In addition to data from their own experiments, biologists also utilize a rich body of available information from Internet-based sources, e.g. genomic and proteomic databases, and from the scientific literature. The structure and content of these sources is also rapidly evolving. The software tools used by molecular biologists need to gracefully accommodate new and rapidly changing data types.
One manner in which biologists use these experimental data and other sources of information is in an effort to piece together interpretations and form hypotheses about biological processes, also referred to as building biological models. Textual documents are often relied upon as a source of “known” information, which can be used to compare to experimental data and/or in constructing biological diagrams/models, to confirm or refute hypotheses or data resulting from experimentation for example.
A large number of systems have been developed to automatically build biological models from these various sources of biological data. However, these tools suffer from at least two major limitations: they lack accuracy in extracting knowledge for building the biological models, and also, they cannot incorporate a user's changing contexts and hence are not true to users' intents. Manual building of models has the strength that the model is true to the user's intent. By manually building, the model builder can capture all the nuances and subtleties that only a human can provide. There are significant disadvantages in manually building such models, however, in that the process of building biological models is tedious and error prone, particularly as data and models get larger and more complex.
Likewise, manual extraction of knowledge from text has the advantage that the extractions made are each individually chosen by the user and therefore only relevant data is generally extracted by this method. Again, however, this method is extremely tedious, time-consuming, and inefficient.
There are currently systems that can generate biological network information, such as protein-protein interaction networks, via knowledge extraction from text, and which display their output via network diagrams. Examples of these are Ariadne Genomics (www.ariadnegenomics.com); Apelon (www.apelon.com), BioSentients (www.biosentients.com); BioWisdom (www.biowisdom.co.uk); Cellomics CellSpace™ (http://www.cellomics.com/products/cellsace/); Definiens (www.definiens.de); Gene Ed/Reel Two (www.geneed.com www.reeltwo.com); Incellico (www.incellico.com); Ingenuity (www.ingenuity.com); Insightful (www.insightful.com); Iridescent (http://innovation.swmed.edu./Biocomnputing/Computing.htm); Pre-BIND (http://www.binddb.org); PubGene (http://www.pubgene.com/); Virtual Genetics (www.vglab.com); and XMine (htt://www.x-mine.com/). These systems rely on statistical and linguistic natural language processing to automatically pre-compute protein-protein interactions from scientific text into a database. They therefore present a completely generated network to the user; there is no opportunity for the user to guide and/or improve the process of knowledge extraction by disambiguating and/or assigning directionality or causality. These systems are also plagued by numerous inaccuracies and inconsistencies, leading to skepticism by would-be users in real practice.
In view of the existing systems, what is needed are systems methods and tools capable of not only easily and semi-automafically (i.e., providing the opportunity for user input and/or editing) extracting knowledge or relevant information from textual documents, but which also provide for user interaction to guide and improve the resultant information that is extracted, such as by error correction, disambiguation, and/or custom tailoring to the user's needs.