The amount of textual data in modem society is continuously growing larger. The reasons for this are varied, but one important driving force is the widespread deployment of personal computer systems and databases, and the continuously increasing volume of electronic mail. The result is the widespread creation, diffusion and required storage of document data in various forms and manifestations.
While the overall trend is positive, as the diffusion of knowledge through society is generally deemed to be a beneficial goal, a problem is created in that the amount of document data can far exceed the abilities of an interested person or organization to read, assimilate and categorize the document data.
While textual data may at present represent the bulk of document data, and is primarily discussed in the context of this patent application, increasingly documents are created and distributed in multi-media form, such as in the form of a document that contains both text and images (either static or dynamic, such as video clips), or in the form of a document that contains both text and audio.
In response to the increasing volume of text-based document data, it has become apparent that some efficient means to manage this increasing corpus of document data must be developed. This field of endeavor can be referred to as unstructured information management, and may be considered to encompass both the tools and methods that are required to store, access, retrieve, navigate and discover knowledge in (primarily) text-based information.
For example, as business methods continue to evolve there is a growing need to process unstructured information in an efficient and thorough manner. Examples of such information include recorded natural language dialog, multi-lingual dialog, texts translations, scientific publications, and others.
Commonly assigned U.S. Pat. No. 6,553,385 B2, “Architecture of a Framework for Information Extraction from Natural Language Documents”, by David E. Johnson and Thomas Hampp-Bahnmueller, describes a framework for information extraction from natural language documents that is application independent and that provides a high degree of reusability. The framework integrates different Natural Language/Machine Learning techniques, such as parsing and classification. The architecture of the framework is integrated in an easily-used access layer. The framework performs general information extraction, classification/categorization of natural language documents, automated electronic data transmission (e.g., e-mail and facsimile) processing and routing, and parsing. Within the framework, requests for information extraction are passed to information extractors. The framework can accommodate both pre-processing and post-processing of application data and control of the extractors. The framework can also suggest necessary actions that applications should take on the data. To achieve the goal of easy integration and extension, the framework provides an integration (external) application program interface (API) and an extractor (internal) API.
The disclosure of U.S. Pat. No. 6,553,385 B2 is incorporated herein be reference in so far as it does not conflict with the teachings of this invention.
What is needed is an ability to efficiently and comprehensively process documentary data from a variety of sources and in a variety of formats to extract desired information from the documentary data for purposes that include, but are not limited to, searching, indexing, categorizing and data and textual mining.