In a number of complex scientific and economic fields, or knowledge domains, the majority of the existing knowledge and information about common concepts in these knowledge domains is recorded only in free-text form and does not exist in any structured form, such as structured fields in structured databases. This limits the possibility of utilizing this knowledge in the computerized analysis of data from such domains. Moreover, due to the complexity and amount of knowledge, navigation in the knowledge spaces of such domains is cumbersome and would be much easier if some of this information could be extracted and structured around pertinent concepts in such domains.
Conventional methods for data analysis, including but not limited to statistical methods and machine-learning methods, have few or no means for incorporating general background knowledge not explicitly encoded in a target data set to be analysed. With this background and in light of the aforementioned fact that much existing knowledge relevant to the interpretation of the results of conventional data analysis is only recorded in free-text, there is a need for methods for automating the extraction of such knowledge in a form amenable for further incorporation into data analysis and methods for carrying out integrated data analysis incorporating such extracted background knowledge into the analysis of a target data set are highly desired.
Information retrieval in many knowledge domains tends to be complicated by concepts having several names, either one primary name and aliases or several equivalent synonyms. Synonymy hampers information retrieval, as the user is required to know all alternative names of a concept in order to be able to specify for a system to retrieve all relevant information existing. Polysemy, i.e., the phenomenon that several concepts have the same name, further complicates information retrieval. In this case information retrieval systems tend to produce documents not only relevant to the target concept, but also documents irrelevant to the target but relevant for other concepts with the same name.
Functional genomics is at an early stage, but with some genomes sequenced and others near completion, much attention will shift towards determining the biological function of genes and other functional parts of the genome sequences. High throughput technologies for measuring gene expression and protein levels in living cells and tissue types will be critical tools in this research. Unsupervised methods for data analysis, such as hierarchical clustering, make no use of existing knowledge leaving this to the user in the interpretation following analysis. As thousands of genes/proteins can be measured in a single experiment a computerized procedure utilizing the vast amounts of existing knowledge is highly desired.
A prerequisite for such procedures is the availability of background knowledge in a computer-readable format. Existing structured databases in the field cover but a small fraction of all current knowledge and the majority of this knowledge is recorded only in free-text format, that is not readily available for incorporation in computerized data analysis. Related approaches for information extraction in this field have focused on specific relationships between molecular entities, such as protein-protein interactions, protein-gene interactions, protein-drug interactions, cellular location of proteins, and molecular binding relationships. Detection of occurrences of entities has been done using recognition of nouns or noun phrases and by the use of predefined keyword lists. Keyword indexing has been used to annotate proteins. Reference is made to Stapley, B. J. & Benoit, G., Biobibliometrics: Information retrieval and visualization from co-occurrence of gene names in Medline abstracts. Pac Symp Biocomput 5, 524–540 (2000). This discloses a gene network based on term occurrence, concerning genes of the yeast organism extracted from 2,524 MEDLINE documents chosen on the basis of being from the years 1997 or 1998 and also containing the MeSH term ‘Saccharomyces cerevisiae’. Their approach also describes the construction of a gene network based on gene terms.
At RECOMB2000 in Tokyo, Japan, Apr. 8–11 2000, a printed pamphlet was presented by Tor-Kristian Jenssen, Astrid Lagreid, Jan Komorowski and Eivind Hovig, entitled “Pubgen: Discovering and visualizing gene-gene relations”. This describes the creation of a network of gene relationships. To identify genes as correctly identified in an article, gene symbols, and to some extent gene names, were used. It was recognised that a few particular genes and symbols would require special consideration.