With the popularity of computer technology and the rapid development of the Internet, an abundance of information is coming forth in an electronic document format. To meet the big challenge brought by the information explosion, it is imminently requiring a number of automatic tools to help people extract the bits of information that are needed from a vast sea of information. Under this given background, information extraction (IE) comes into being.
Information extraction is a form of shallow text processing that locates a specified set of relevant information (entities, events, etc.) in a natural-language document, with the objective of having the text information structured and tabulated. The primary function of information extracting system is to extract particular entity information. Taking an information extracting system for example, the information extraction process typically including: 1) identifies; 2) extracts specific information located in non-structured textual data; and 3) generates the output as has been requested. Such technology is disclosed by, for example, N. Catala, N. Castell, M. Martin. ESSENCE: a Portable Methodology for acquiring information extraction Patterns. Proceedings of 14th European Conference on Artificial Intelligence (ECAI-2000), 411-415. Berlin, 2000, which is herewith incorporated by reference. The extracted information are structurally described, and can be directly stored in a database for user's query, further analysis and utilization.
There are two main approaches to the design of IE systems, which can be the Knowledge Engineering Approach and the Automatic Training Approach. For example, Appelt, D. E. and Israel, D. J. Introduction to information extraction Technology, In Proceedings of the 16th International Joint Conference on Artificial Intelligence, 1999, which is herewith incorporated by reference, is disclosed in more detail the discussed technology.
The Knowledge Engineering Approach is characterized by manually compiling rules to enable the IE system to process the information extraction related issues of particular knowledge domain. It requires the “knowledge engineer” who compiles the rules to be quite familiar with the same knowledge domain, it is obviously the case that the skill of the knowledge engineer plays an important role in the level of performance that will be achieved by the overall system. In addition to requiring skill and detailed knowledge of a particular IE system, the Knowledge Engineering Approach usually requires a lot of additional labor as well for optimizing self-performance. For example, building a high performance system is usually an iterative process whereby a set of rules is written. After that, the system is run over a training corpus which has been annotated, and the output is examined to see whether the rules are under-and over-generate. The knowledge engineer then makes appropriate modification to the rules, and iterates the process till a complete set of rules is achieved. This is a difficult and time-consuming task, and requires a high level of intelligence.
The Automatic Training Approach is not necessary to have such professional knowledge engineer, i.e. it is not necessary to have someone on hand with detailed knowledge of how the IE system works, or how to write rules for it. This approach needs the user to provide a great deal of typical learning (training) corpus. The system is mainly trained by (training) sample which has been annotated and extracts rules from these samples. All who are familiar with the knowledge domain can annotate the learning (training) corpus and build the relevant corpus in accordance with the pre-defined criterion. Then, after being trained, systems can process entirely new texts. Following this approach, it is not necessary to have users on hand with detailed knowledge of how the IE system works, or how to write rules for it. It is necessary only to have users who know enough about the domain and the task to take a corpus of texts, and annotate the texts appropriately for the information being extracted. Typically, the annotations would focus on one particular aspect of the system's processing. For example, a name recognizer would be trained by annotating a corpus of texts with the domain-relevant proper names. Once a suitable training corpus has been annotated, a training algorithm is run, resulting in information that a system can employ in analyzing novel texts.
Although many methods have been proposed for extracting information from unstructured text, none of these methods can produce satisfying result due to the limitation of the existing learning and training algorithms. As for the Knowledge Engineering Approach, the construction of IE patterns are probably very time-consuming and needs the knowledge engineer who is responsible for writing rules to have an in-depth acquaintance with the domain knowledge of which an ordinary user is short. The Automatic Training Approach is not that time-consuming comparing with the Knowledge Engineering Approach, but sufficient training data is required to ensure a high processing quality. The major limitations of existing Automatic Training Approach for building IE patterns are: dependence on linguistic processing, machine learning or data mining techniques. Most of the methods need an annotated training corpus, which is a very tedious work and must be done by a domain expert.
In addition, in the traditional scenario, the electronic document writing tools are independent from the tools that users use to manage documents, such as information extraction (IE) systems. The current situation is that the writer does not care how the reader will leverage the content when he prepares it. While at the same time, from the information-accessing point of view, the user feels great challenge to get the right thing he/she wants.
Moreover, the current technologies work mainly at the word level understanding, while the real world applications, such as the electronic document managing tools and the electronic document information extraction tools need sentence and document level understanding together with semantic capabilities to meet the customers' requirements in deed.