1. Field of the Invention
This invention relates, generally, to text mining. More particularly, it relates to an automated, standardized method of mining text from various literatures, for example extraction of relationships among proteins.
2. Description of the Prior Art
Biological text mining is known in the art [72], and example software applications and companies include GOPUBMED [73], PUBGENE [74], TPX [75], IPA by INGENUITY®, and NETPRO and XTRACTOR by Molecular Connections. Relationships between two terms, keywords or names, constitute a significant part of public knowledge. Much of such information is documented as unstructured text in different places and forms, such as books, articles and online pages. Though some improvements have been made to improve manual annotation, collecting this information from the literature must still be performed manually. This decreases efficiency, increases incidence of error, decreases organization/standardized format, and increases costs of text mining.
A significant part of biological knowledge is centered on relationships among different biological terms including proteins, genes, small molecules, pathways, diseases, and gene ontology (GO) terms (collectively referred to herein as “bio-entities”). Information on bio-entity relationships, such as protein-protein interactions (PPIs), is indispensable for current understanding of the development of drugs and mechanisms of biological processes and complex diseases [1]. Due to the importance of such information, manual annotation has been used to extract information from scientific literature and deposit this information into various databases [2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19]. However, manual annotation is quite time- and resource-consuming, and it has become increasingly difficult to keep pace with the ever-increasing publications in biomedical sciences. In recent years, computational methods have been developed to automatically extract molecular interaction information and other bio-entity relationships from the literature, and the software has been used to assist human annotators to build databases [20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41].
Many computational studies have recently attempted to extract PPIs from published literatures, mostly PubMed abstracts due to their easy access [42,43]. All methods detect PPIs based on some rules (or patterns, templates, etc.) that can be generated by two approaches: (1) specifying them manually [24,43,44,45,46,47,48,49,50,51,52,53,54,55], or (2) computationally inferring/learning them from manually annotated sentences [56,57,58].
Initial efforts of PPI detection were based on simple rules, such as co-occurrence, which assumes that two proteins likely interact with each other if they co-occur in the same sentence/abstract [59,60]. These approaches tend to produce a large number of false positives, and still require significant manual annotations.
Later studies, aiming to reduce the high false positive rate of earlier methods, used manually specified rules. Although such methods sometimes achieved a higher accuracy than co-occurrence methods by extracting cases satisfying the rules, they have low coverage due to missing cases not covered by the limited number of manually specified rules [24,43,44,45,46,47,48,49,50,51,52,53,54,55].
Recently, machine learning based methods [56,57,58] have achieved better performances than other methods in terms of both decreasing false positive rate and increasing the coverage by automatically learning the language rules using annotated texts. Huang [56] used a dynamic programming algorithm, similar to that used in sequence alignment, to extract patterns in sentences tagged by part-of-speech tagger. Kim (2008a,b) used a kernel approach for learning genetic and protein-protein interaction patterns.
Despite extensive studies, current techniques appear to have only achieved partial success on relatively small datasets. Specifically, Park tested their combinatory categorical grammar (CCG) method on 492 sentences and obtained a recall and precision rate of 48% and 80%, respectively [47]. Context-free grammar (CFC) method of Temkin et. al was tested on 100 randomly selected abstracts and obtained a recall and precision of 63.9% and 70.2%, respectively [46]. Preposition-based parsing method was tested on 50 abstracts with a precision of 70% [52]. A relational parsing method for extracting only inhibition relation was tested on 500 abstracts with a precision and recall of 90% and 57%, respectively [45]. Ono manually specified rules for four interaction verbs (interact, bind, complex, associate), which were tested on 1586 sentences related to yeast and E. coli, and obtained an average recall and precision of 83.6% and 93.2%, respectively [53]. Huang et al. used a sequence alignment based dynamic programming approach and obtained a recall rate of 80.0% and precision rate of 80.5% on 1200 sentences extracted from online articles [56].
However, a closer analysis of Ono's and Huang's datasets show that they are very biased in terms of the interaction words used. Ono's dataset contains just four interaction words, while in Huang's study, although more verbs were mentioned, the number of sentences containing “interact” and “bind” (and their variants) represents 59.3% of all 1,200 sentences. In Ono's dataset, there is an unrealistic high proportion of true samples (74.7%), making it much easier to obtain good recall and precision. In Huang's study, an arbitrary number of sentences were chosen from 1,200 sentences as training data and the rest as testing data, while some cross validation tests should be used. Tim et al. (2008b) developed a web server, PIE, and tested their method on BioCreative [37,38,61] dataset and achieved very good performance—for PPI article filter task.
Accordingly, given the amount of information produced in digital format every day, what is needed is an automated, accurate, and thorough method of mining bio-entity information from literature as structured form. However, in view of the art considered as a whole at the time the present invention was made, it was not obvious to those of ordinary skill how the art could be advanced.
While certain aspects of conventional technologies have been discussed to facilitate disclosure of the invention, Applicants in no way disclaim these technical aspects, and it is contemplated that the claimed invention may encompass one or more of the conventional technical aspects discussed herein.
The present invention may address one or more of the problems and deficiencies of the prior art discussed above. However, it is contemplated that the invention may prove useful in addressing other problems and deficiencies in a number of technical areas. Therefore, the claimed invention should not necessarily be construed as limited to addressing any of the particular problems or deficiencies discussed herein.
In this specification, where a document, act or item of knowledge is referred to or discussed, this reference or discussion is not an admission that the document, act or item of knowledge or any combination thereof was at the priority date, publicly available, known to the public, part of common general knowledge, or otherwise constitutes prior art under the applicable statutory provisions; or is known to be relevant to an attempt to solve any problem with which this specification is concerned.