1. Field of the Invention
The present invention relates to an apparatus and method for recognizing biological named entity from biological literature based on united medical language system (UMLS), in which the biological named entity is recognized and grouped.
2. Description of the Related Art
As the volume of biological literature is increased by active study on biology, also increased are demands for extraction of information from the literature at high quality. A protein name, a gene name, and a name of an element of laboratory organism or living organism constitute the core of information in the biological literature that describes results of important biological studies. Accordingly, in order to extract information from the biological literature, names of the biological entity should be exactly recognized and classified first. The information extraction is performed on the literature so as to find information subjects, relation between the information subjects, and information flow of the information subject. Accordingly, even in case of extracting information from the biological literature, the biological named entities that are information subjects in the literature should be first recognized. Generally, as the method of recognizing the biological named entity, there is a rule-based method in which an expert who has biological knowledge creates various language resources and rules on an limited object domain and the named entity is recognized using the created various language resources and rules. There is also a statistic-based method in which a large amount of biological literature learning corpus is constructed and a machine learning algorithm is applied to recognize the named entity. The former method costs much in creation of language resources and rules and the latter method costs much in construction of the biological literature learning corpus.
In the prior art, the technology in which new names are recognized and extracted is registered as U.S. Pat. No. 5,819,265 “processing names in a text” on Oct. 6, 1998. However, the preceding patent does not disclose “the process of the biological literature based on UMLS” and also, the system according to the preceding patent may work erroneously if names in which names or spells appeared in the literature occasionally are similar but meanings thereof are different would be appeared.
In the other prior arts, David A. Campbell and Stephen B. Johnson reported “A Technique for Semantic Classification of Unknown Works Using UMLS Resources” in Proceedings of American Medical Informations Association Symposium, pp 716-720 on November, 1999, and Irena Spasic, Coran Nenadic and Sophia Ananiadou reported “Using Domain Specific Verbs for Term Classification” in Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine, pp 17-24 on July, 2003. In the method for recognizing the biological named entity, which is disclosed in the prior articles, UMLS and corpus should be simultaneously used and pattern rules are limited to a specific form so that it is limited to recognize the newly generated various named entities.