1. Field of Invention
The present invention relates to language processing methods and systems, and more particularly, to a method and an apparatus for named entity recognition in natural language so as to extract language information and perform corresponding processes.
2. Description of Prior Art
A named entity refers to a set of words including person names, location names, organization names, time, quantity and the like. The recognition of the named entities is widely used in terms of information extraction and information retrieval.
Recently, gradual recognizing methods or chunk recognizing methods for named entity (NE) show a higher performance, which have been described, for example, in Chunking with Support Vector Machines, Taku Kudo, Yuji Matsumoto, NAACL, 2001. The major characteristics of these methods are that the recognition is divided into several successive steps, and each step scans one word included in the input sentence and predicts a token of the current word by observing context features of the current word and using a predetermined or stochastic method. Different sets of tokens are used in different methods, but substantially include four types of B, I, E and O, respectively representing a Beginning (B), an Intermediate (I) and an End (E) positions of the named entity and not a named entity (O). After determining the tokens of all the words included in the inputting sentence, the string of all the tokens of B, I, E forms a named entity. In each step of the recognition, the features used by a recognizer are local features included in a feature window centered with the current word.
Table 1 is an example of a method for parsing a sentence from its beginning position to its end position, hereinafter referred to as forward parsing.
TABLE 1PARSING DIRECTIONWORDSL2L1CR1R2FEATURESFL2FL1FCFR1FR2TOKENSTL2TL1TCN/A*N/A*prior wordscurrent wordposterior words
In Table 1, C denotes a current word, L1 and L2 denote left contexts of the current word, and R1 and R2 denote right contexts of the current word. The size of the feature window is 5. FC, FL1, FL2, FR1 and FR2 are the features corresponding to respective words in the feature window. TL1 and TL2 are the recognized tokens of the prior words. N/A denotes that the feature can not be obtained at the current time instant.
The feature denotes all information that can be observed in the contexts, for example, what the word is, the length of the word, part-of-speech (POS) of the word, what the token of the word is according to the prior words, and the like, as shown in following Table 2. It is determined by the system designer in accordance with the characteristics of the applications to use which features, for the purpose of making the system have the highest recognizing performance. In the forward parsing shown in Table 2, when the system observes all these features, it can predict a token of “B-PER” for the current word “ (Deng)”.
TABLE 2PARSING DIRECTIONWORDS   (DENG)  (decision)(inherit)(Xiaoping)(comrade)FEATURES{word =  {word =  {word =  {word =  {word =  length = 2, POS = length =length = 1, POS =length = 2,length = 2,adverb, token =2, POS =person name}POS = personPOS = noun}O}verb, token =name}O}TOKENSOOB-PER*N/AN/APRIOR WORDSCURRENTPOSTERIOR WORDSWORD
In Table 2, the token “B-PER” denotes that the current word is the beginning of a person name.
In the example shown in Table 2, the word  (inherit)” is used as an example, and the features of this word is shown in the 3rd row: the content of the word is   the length of word is 2, POS of the word is verb, and the token is O (it is not a named entity).
In the foregoing, it can be seen that a drawback of the gradual recognizing method is that only local features in one fixed-size feature window can be used. Since long distance features are not used, it may result in a false alarm of the beginning border (token B). That is, a position which is not a beginning border of a named entity might be recognized as a beginning border by the recognizer.
A method of a variable length model is proposed in Named Entity Chunking Techniques in Supervised Learning for Japanese Named Entity Recognition, Manabu Sassano, Takehito Utsuro, COLING 2000: 705-711, in which the size of the feature window can vary in a predetermined range. However, it can be seen that this method can not process features in a range of an arbitrary length either. Some methods based on probabilistic models may use the global features. For example, referring to U.S. patent application Ser. No. 09/403,069, entitled as “System for Chinese tokenization and named entity recognition”, submitted on Feb. 17, 2000. However, the methods based on probabilistic models are greatly influenced by the problem of data sparseness, and need to use complicated decoding methods to search in a huge candidate lattice space. When the training data is insufficient or the computing resources are insufficient (for example, an embedded device), the probabilistic models are not feasible.
In addition, the methods for recognizing named entities in prior art are greatly influenced by erroneous word segmentation. The named entity recognition based on the word segmentation results can not recover the borders incorrectly divided during the word segmentation so that the correctness of the named entity recognition will be influenced. As shown in the example of Table 3, since   is falsely segmented into   it directly causes that the phrase  is falsely recognized into a named entity of a type ORG (organization name). Actually, there is no named entity in the part   of this sentence, but there is a named entity of a type PER (person name) in the tail of the sentence, i.e.  At this time, if a character-based model is used, the false due to the erroneous word segmentation can be avoided.
TABLE 3CHARACTERS[HEAD OF          SENTENCE]CORRECT WORD     SEGMENTATIONPREDICTED WORD     SEGMENTATIONPREDICTED POSnsnnsNgnWORD-BASED NAMEDORGENTITY RECOGNITION
As mentioned above, Kudo et al. select the forward and backward recognition results by using a voting method so as to determine a final token, but the voting results are for the token recognition results of each step, so what are used are still the local features. Additionally, a lot of classifier combination methods are disclosed in other documents. However, the global features are not used in these methods either.