1. Field of the Invention
The present invention relates to a method for named-entity recognition and verification, and more particularly, to a method for named-entity recognition and verification suitable for different languages and application fields.
2. Description of Related Art
As for information processing, named-entity (NE) recognition is an important task for many natural language applications, such as Internet search engines, document indexing, information extraction and machine translation, so as to find the entities of person, location, organization, date, time, percentage and monetary value in text documents. Moreover, in oriental languages (such as Chinese, Japanese and Korean), NE recognition is even more important because it significantly affects the performance of word segmentation, the most fundamental task for understanding the texts in oriental languages. To provide better performance, it is therefore important to accurately combine the information of named-entity with the aforementioned natural language application.
There are two major approaches to NE recognition: the handcrafted approach and the statistical approach. In the first approach, a system usually relies on a large number of handcrafted rules. For example, if the term “Mayor” appears in the text, and the next word is a given name, the system will determine the subsequent words to be a family name. These type of systems can be rapidly prototyped for the computer to process texts with ease. But the shortcoming is such that the number of rules may be increased rapidly, and thus the systems will be harder to maintain and difficult to scale up. Another serious problem with the handcrafted approach is that the system is hard to be ported across different domains (for example, a system originally designed to search for people's name being ported to search for toponym) and different languages. Porting a handcrafted system usually means rewriting all its rules.
To eliminate the above problems, the statistical approach was developed. In general, the statistical approach to NE recognition can be viewed as a two-stage process. First, according to dictionaries and/or pattern matching rules, the input text is tokenized into tokens. Each token may be a word or an NE candidate which can consist of more than one word. Then, a statistical model, such as N-gram model, is used to select the most likely token sequence. Finally, the tokens labeled as NE candidates are picked out from the most likely token sequence. Although, the statistical NE recognition is much more scaleable and portable, its performance is still not satisfactory. Furthermore, the design of each matching rule will significantly influence the final result. A similar problem is encountered in which the number of rules is getting more and the system is getting larger. Therefore, the above conventional named-entity recognition methods desired to be improved.