Much biological knowledge exists in published scientific text. In order to support the creation of databases and to enable the discovery of new relationships, there is great interest in extracting relationships automatically. Several efforts use manually created rules to define patterns of relationships between entities. These approaches are efficient when used in domains that are of limited scope, such as protein-protein interactions or protein transport. However, the complexity and diversity of the semantics used to describe relationships in broad or evolving domains, such as pharmacogenomics (PGx), are harder to capture. No general set of rules exists for extracting the relationships relevant to such fields, and creating/maintaining them manually would be tedious and time consuming.
Syntactic sentence parsers can identify the subject, object, and type of relationships using grammatical rules. General statistical parsing techniques have recently emerged, and there are several general-purpose parsers that yield reasonable results when applied to scientific text. These parsers depend on the need for good domain-specific lexicons of key entities, since named-entity recognition for particular fields in science can be difficult.
Current methods of text mining to extract relationships have at least the following limitations:                Modified entities are not recognized;        Extracted relationships are restricted to a set of pre-defined relationships; and        Extracted entities and relationships are not normalized in a manner that maps concepts into a common framework.The lack of recognizing of modified entities is a problem since the true relationship described in the sentence is often that between specific entities specified by the modifications of the seed terms. The relationships between the two entities can be diverse. Moreover, pre-specification of allowable relationships is time-consuming, non-robust and infeasible given the varied types of relationships used in natural language textual documents. The lack of normalization is problematic because there is no way to collapse heterogeneous ways of stating the same relationship to aggregate identical facts stated differently in the free-form literature.        
Background for the teachings of the present invention include efforts in building the Pharmacogenomic Knowledge Base, PharmGKB (http://www.phar-mgkb.org/) (Klein T, Chang J, Cho M, Easton K, Fergerson R, Hewett M, et al. Integrating genotype and phenotype information: an overview of the PharmGKB project. Pharmacogenomics J 2001; 1(3):167-70; these and all other references cited herein are incorporated by reference for all purposes). An aim of PharmGKB is to catalog all knowledge of how human genetic variation impacts drug-response phenotypes, and is a curated database that summarizes published gene-drug-phenotype relationships.
The rapidly increasing size of the pharmacogenomic literature threatens to overwhelm the PharmGKB curators. Automatic approaches using NLP techniques are promising. Methods based on co-occurrence assume that entities occurring together in a sentence are related, but the semantics of the relationships are not typically captured. Nevertheless, these approaches efficiently identify potential relationships that can subsequently be evaluated. For example, the Pharmspresso system uses co-occurrence to group frequently co-mentioned genes, genomic variations, drugs, and diseases (Garten Y, Altman R B. Pharmspresso: a text mining tool for extraction of pharmacogenomic concepts and relationships from full text. BMC Bioinformatics 2009; 10 (S-2)). These groups are then used to assist curation. Li et al. used the co-occurrence of drug and disease names in MEDLINE abstracts to derive drug-disease relations and to build a disease-specific drug-protein network (Li J, Zhu X, Chen J Y. Building disease-specific drug-protein connectivity maps from molecular interaction networks and pubmed abstracts. PLoS Comput Biol 2009; 5(7):e1000450+). Blaschke et al. and Rosario et al. expanded this co-occurrence approach to extract more complete relations by searching for “tri-co-occurrence” (Blaschke C, Andrade M A, Ouzounis C, Valencia A. Automatic extraction of biological information from scientific text: protein-protein interactions. In: ISMB; 1999. p. 60-7; Rosario B, Hearst M A. Classifying semantic relations in bioscience texts. In: ACL; 2004. p. 430-7). Tri-co-occurrence refers to the co-occurrence of two named entities and one type of relationship in a unique piece of text. Statistical analysis of co-occurrence can help derive semantic similarities between entities (Cohen T, Widdows D. Empirical distributional semantics: methods and biomedical applications. J Biomed Inform 2009; 42(2):390-405).
In contrast to co-occurrence, syntactic parsing can explicitly identify relationships between two entities in text (Wermter J, Hahn U. You can't beat frequency (unless you use linguistic knowledge)—a qualitative evaluation of association measures for collocation and term extraction. In: ACL; 2006). Hand-coded parsing rules can extract protein-protein interactions and protein transport relationships (Hirschman L, Krallinger M, Wilbur J, Valencia A, editors. The biocreative II—critical assessment for information extraction in biology challenge, vol. 9, Genome Biology; 2008; (Tsujii J, editor. In: Proceedings of the BioNLP 2009 workshop companion volume for shared task; 2009). Fundel et al. defined three general patterns of relations (specifying the semantic type of subjects and objects, and using a lexicon of association words) to identify protein-protein interactions (Fundel K, Kuffner R, Zimmer R. Relex—relation extraction using dependency parse trees. Bioinformatics 2007; 23(3):365-71). For example their pattern “effector-relation-effectee” enables the capture of relationships of the form “protein A activates protein B”. The OpenDMAP system also uses patterns to identify protein interaction and transport (Hunter L, Lu Z, Firby J, Baumgartner Jr W A, Johnson H L, Ogren P V, Cohen K B. OpenDMAP: an open-source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC Bioinformatics 9(78)). Ahlers et al. used vocabularies and semantic types of the UMLS (Unified Medical Language System) to specify patterns to extract gene-disease and drug-disease relationships (Ahlers C B, Fiszman M, Demner-Fushman D, Lang F-M, Rindflesch T C. Extracting semantic predications from MEDLINE citations for pharmacogenomics. In: Pacific Symposium on Biocomputing; 2007, pp. 209-220). Several groups have used extracted relationships to create networks, including molecular interaction networks (Friedman C, Kra P, Yu H, Krauthammer M, Rzhetsky A. Genies: a natural-language processing system for the extraction of molecular pathways from journal articles. In: ISMB (supplement of bioinformatics); 2001. p. 74-82), gene-disease networks (Rindflesch T C, Libbus B, Hristovski D, Aronson A R, Kilicoglu H. Semantic relations asserting the etiology of genetic diseases. In: AMIA Annu Symp Proc 2003; 2003. p. 554-8), regulatory gene expression networks (Saric J, Jensen L J, Ouzounova R, Rojas I, Bork P. Extraction of regulatory gene/protein networks from MEDLINE. Bioinformatics 2006; 22(6):645-50), and gene-drug-disease networks (Tari L, Hakenberg J, Gonzalez G, Baral C. Querying parse tree database of MEDLINE text to synthesize user-specific biomolecular networks. In: Pacific symposium on biocomputing; 2009. p. 87-98). In order to be efficient, these syntactic approaches often rely on large sets of patterns and stable ontologies to guarantee performance on diverse sentence structures. Unfortunately, a systematic catalog of patterns for pharmacogenomics is not available (Dumontier M, Villanueva-Rosales N. Towards pharmacogenomics knowledge discovery on the semantic web. Briefings Bioinform 2009; 100:153-63; Coulet A, Smail-Tabbone M, Napoli A, Devignes M D. Suggested ontology for pharmacogenomics (SO-pharm): modular construction and preliminary testing. In: KSinBIT; 2006, LNCS 4277. p. 648-57).
The Semantic Web community has developed methods for learning ontologies from text using unsupervised approaches (Aussenac-Gilles N, Soergel D. Text analysis for ontology and terminology engineering. Appl Ontol 2005; 1(1):35-46; Buitelaar P, Cimiano P, Magnini B. Ontology learning from text: methods, evaluation and applications, vol. 123 of frontiers in artificial intelligence. IOS Press; 2005). Most of these efforts focus on learning hierarchies of concepts. Ciaramita et al. studied unsupervised learning of relationships between concepts (Ciaramita M, Gangemi A, Ratsch E, Saric J, Rojas I. Unsupervised learning of semantic relations between concepts of a molecular biology ontology. In: IJCAI; 2005. p. 659-64). Their method produces a network of concepts where edges are associated with precise semantics (e.g., Virus encodes Protein).
Other efforts have focused on enriching existing ontologies for NLP using Web content (Ontology Development Information Extraction (ODIE) project: http://www.bioontology.org/ODIE-project, [accessed 02.11.10]). Cilibrasi and Vitányi proposed a method to automatically learn the semantics of processed words, hypothesizing that semantically related words co-occur more frequently in Web pages than do unrelated words (Cilibrasi R, Vitányi PMB. Automatic meaning discovery using Google. In: Kolmogorov complexity and applications; 2006). Gupta and Oates used Web content to identify concept mappings for previously unrecognized words discovered while processing text (Gupta A, Oates T. Using ontologies and the web to learn lexical semantics. In: IJCAI; 2007. p. 1618-23).