The explosion of published information in the fields of biology, biochemistry, genetics and related fields (collectively referred to herein as “genomics”) presents research scientists with the enormous challenge of searching and analyzing a massive amount of published information to find the particular information of interest. The majority of new genomics information is produced and stored in text form. Information stored in text form is unstructured and, other than key word searches of various types, relatively inaccessible to standard computer search techniques.
The process of culling and reviewing relevant information from the published literature is consequently a laborious and time-consuming one. Even the most basic queries about the function of a particular gene using even sophisticated key word searches often result in generating too many articles to be reviewed carefully in a reasonable amount of time, missing critical articles with important findings expressed in a non-standard manner and form or both.
Text storage was never designed for and has not proven adequate to the task of describing and clarifying the complex, interrelated biochemical pathways involved in biological systems. Examples of high-level computational tasks that cannot be performed on text-based databases include: a) computational identification of clusters of diverse functionally interrelated genes that occur in genomic data sets; b) systematic, principled prediction of gene function using computation over links between uncharacterized genes and other genes in the genome, using all functional relationships available in the literature rather than just the available experimental genomic data sets; c) novel biological inferences in the knowledge base, based on computation over large bodies of existing, explicitly entered content; and d) flexible computation of the genes that constitute biological pathways, based on criteria such as upstream versus downstream genes, transcriptional versus phosphorylation targets, membrane-bound versus nuclear genes, etc.
By limiting a researcher's ability to ask these types of questions when searching for information, the current text-based model of information storage is a serious obstacle to research in genomics. The ever-increasing volume of functional genetic data resulting from the biotechnology revolution further demonstrates how both the academic and industrial communities require a more readily computable means for archiving and mining the genomics information.
The desirability of placing the published genomics information into a structured format and thus allowing easier and more useful searches is known, for example by storing information extracted from text in a frame-based knowledge representation system. Although examples of frame-based knowledge representation systems are known in several fields, the difficulties in populating such a system with specific genomics information, leading to the creation of a true genomics knowledge base are substantial.
The process to populate a frame-based knowledge representation system (herein “KRS”) with information, leading to the creation of what is called a “knowledge base,” (“KB”) is known as knowledge acquisition (KA). KA is recognized as a slow, difficult and expensive process. KA is a major and perhaps the major bottleneck in building functional and useful KBs. A consequence of the difficulties associated with KA is that most KBs are small and concentrate on a very limited domain of interest.
Known methods of performing the KA function require a knowledge representation expert or knowledge engineer (KE) with computer science training to work with the appropriate domain experts to manually capture and then organize the extracted information into the KRS. The KE transcribes, structures and embeds this information into the KB. KEs must have an understanding of the underlying formal machine representation of the KRS in order to extract the information from the text source and then insert the information into the KRS in a consistent, accurate and appropriate manner. Often the KE works closely with scientific experts to classify and categorize the information properly. The need for two highly trained individuals to work together to structure and enter the information makes this approach to populating a KRS extremely time consuming and expensive. These problems also greatly restrict the extent to which this process can be used as the amount of information to be captured increases.
As millions of findings must be captured and structured to create a KB of the size and scope necessary for useful genomics research, a method for efficiently and economically populating a genomics KRS with structured, codified information to create a usable KB is needed.