1. Field of the Invention
The present invention generally relates to knowledge information processing and, more particularly, to a general architecture of a framework for information extraction from natural language (NL) documents. The framework can be configured and integrated in applications and may be extended by user built information extractors.
2. Background Description
Businesses and institutions generate many documents in the course of their commerce and activities. These are typically written for exchange between persons without any plan for machine storage and retrieval. The documents, for purposes of differentiation, are described as xe2x80x9cnatural languagexe2x80x9d documents as distinguished from documents or files written for machine storage and retrieval.
Natural language documents have for some time been archived on various media, originally as images and more recently as converted data. More specifically, documents available only in hard copy form are scanned and the scanned images processed by optical character recognition software to generate machine language files. The generated machine language files can then be compactly stored on magnetic or optical media. Documents originally generated by a computer, such as with word processor, spread sheet or database software, can of course be stored directly to magnetic or optical media. In the latter case, the formatting information is part of the data stored, whereas in the case of scanned documents, such information is typically lost.
There is a significant advantage from a storage and archival stand point to storing natural language documents in this way, but there remains a problem of retrieving information from the stored documents. In the past, this has been accomplished by separately preparing an index to access the documents. Of course, the effectiveness of this technique depends largely on the design of the index. A number of full text search software products have been developed which will respond to structured queries to search a document database. These, however, are effective only for relatively small databases and are often application dependent; that is, capable of searching only those databases created by specific software applications.
The natural language documents of a business or institution represents a substantial resource for that business or institution. However, that resource is only [a] as valuable as the ability to access the information it contains. Considerable effort is now being made to develop software for the extraction of information from natural language documents. Such software is generally in the field of knowledge based or expert systems and uses such techniques as parsing and classifying. The general applications, in addition to information extraction, include classification and categorization of natural language documents and automated electronic data transmission processing and routing, including E-mail and facsimile.
It is therefore an object of the present invention to provide a framework for information extraction from natural language documents which is application independent and provides a high degree of reusability.
It is another object of the invention to provide a framework for information extraction which integrates different Natural Language/Machine Learning techniques, such as parsing and classification.
According to the invention, there is provided an architecture of a framework for information extraction from natural language documents which is integrated in an easy to use access layer. The framework performs general information extraction, classification/categorization of natural language documents, automated electronic data transmission (e.g., E-mail and facsimile) processing and routing, and parsing.
Inside the framework, requests for information extraction are passed to the actual extractors. The framework can handle both pre- and post processing of the application data, control of the extractors, enrich the information extracted by the extractors. The framework can also suggest necessary actions the application should take on the data. To achieve the goal of easy integration and extension, the framework provides an integration (outside) application program interface (API) and an extractor (inside) API. The outside API is for the application program that wants to use the framework, allowing the framework to be integrated by calling simple functions. The extractor API is the API for doing the actual processing. The architecture of the framework allows the framework to be extended by providing new libraries exporting certain simple functions.