Field of the Invention
The present invention generally relates to the field of information processing technology, and more particularly, to a method and system for recognizing chemical names in a Chinese document.
Description of the Related Art
Currently, with the development of scientific technology in chemistry, the amount of scientific literatures relating to chemistry is increasing, e.g., scientific papers and disclosed patent documents relating to chemistry, among which the number of Chinese documents is also increasing. Chemical name recognition technology proves to be important in the in-depth computer processing of these documents. Those skilled in the art will appreciate that chemical names refer to the names that appear in professional chemistry documents and can uniquely specify the corresponding chemical molecular structures. Chinese chemical names evolve from the IUPAC nomenclature and Chinese common names. The objective of chemical name recognition technology is to automatically detect and identify chemical names from natural language documents, and it is very useful for various data mining in chemical or biochemical fields.
Currently, research has been carried out in English chemical name recognition, which may be mainly divided into two types: one is to use a machine learning model to learn training data to form annotators, and use the annotators to recognize chemical names from plain text documents. Machine learning models mainly include Hidden Markov Model (HMM) (Freitag and McCallum, 1999), Maximum Entropy Markov Model (MeMM) (McCallum et al., 2000) or Conditional Random Fields (CRF) (Lafferty et al., 2001). The other is to carry out chemical name recognition based on rules designed by experts and dictionaries.
There is little Chinese chemical name recognition technology up to now. The reasons for this lie in the following: First, Chinese is much more complex in linguistic structures when compared with English, e.g., there is no explicit word boundary in Chinese (while English words are separated by spaces), and Chinese does not have capitalized information that can be utilized. The distinctive linguistic characteristics prevent applying English chemical name recognition technology to a Chinese environment. Second, current Chinese chemical nomenclature does not precisely coincide with the English chemical nomenclature system; rather, it is a mixture of Chinese traditional nomenclature and IUPAC standard. Therefore, if the manner of model learning is used, at least both the Chinese traditional nomenclature and the IUPAC standard should be taken into consideration. Third, compared to English, there are not many Chinese chemical name resources available, thus it is difficult to carry out Chinese chemical name recognition by means of model learning.
Therefore, there is a need for a method and system for recognizing chemical names in a Chinese document currently.