1. The Field of the Present Invention
The present invention relates generally to an apparatus, system and method for the creation of a fully configurable, customizable, adaptive, and scriptable natural language processing software application development system. The invention consists of document management, semantic exploration, application development and deployment, and application feedback systems, each with task-specific graphical interfaces.
2. General Background
Information extraction (IE) applications are natural language processing (NLP) systems used on their own or coupled with information retrieval, text analytics, and text mining systems to identify, normalize, and remove duplicate information elements found in documents. IE applications are used to discover and organize the latent meaningful and fine-grained content elements of documents. These content elements include such information “entities” as persons, places, times, objects, events, and relationships among them. For example, an IE task in finance and business might consist of processing business articles and press releases to identify, normalize, and relate the names of companies, stock ticker symbols, employees and corporate officers, times, and events such as mergers and acquisitions. These information elements are thereby made suitable for storage and retrieval by database and information retrieval systems. In the finance and business example, these data might be used to alert investors, bankers, and brokers of significant business transactions or to assist in the detection of business fraud or insider trading.
IE is related to but distinct from information retrieval (IR). IR is concerned with searching and retrieving documents or document passages that provide information that is relevant to a user's query, usually supplied in natural language as a few terms. Document clustering and classification are related NLP techniques that can provide other types of high-level document search or navigation aids. They complement IR by organizing documents or sections of documents into meaningfully related groups and sub-groups based on content. Another related NLP technology is document summarization, which attempts to find a small number of passages in one or more documents that characterize their content succinctly. Still another related NLP technology is question answering, which attempts to find passages in documents or construct answers from documents that represent the answers to questions such as “When was Abraham Lincoln born?” or “Why is the sky blue?”.
IE plays a role in IR because it identifies and normalizes information in natural language documents. This information improves the quality of search indexes and enables alternative navigation methods. It also brings IR closer to fielded database search because the diversity of expression in text documents has been disciplined through normalization. For instance, in the mergers and acquisitions example, the names of companies, persons, products, times, and events would be represented in a uniform and conventional manner. This makes it significantly easier to identify business activities for a given company such as IBM even if the original texts mentioned the company in many different ways (for example, using fully expressed names such as “International Business Machines Corporation” and “International Business Machines” or the acronym “IBM”).
Although the development process for any given IE application will differ in important details than that for other IE applications, development normally requires a series of steps that are broadly similar over the spectrum of IE applications. A developer (or, for more complex information extraction tasks, a team of developers) creates an IE application. The code and data for the application may be entirely new, or may be based on existing materials. The problem the IE application is to solve may be well-understood in advance, or problem analysis and exploration may be an essential part of the process. In some cases, the exploration and analysis of a specific data set may be the final objective of an IE application, rather than the creation of a deployable system for analyzing similar documents.
During development, the developer may need to evaluate the IE application, by running the application on training and benchmark data, to understand how development is progressing and how accurately the application is expected to perform its intended task.
When the application performs with the required accuracy, it is deployed. IE applications may be deployed in a wide variety of environments, for many different purposes. In some environments, applications process large amounts of text in batch. In others, applications process individual documents, or small numbers of them, from time to time on the behalf of specific end-users. In yet others, end-users may explore a document collection for well established or “ad hoc” categories of information. In some environments, it is possible for end-users to provide feedback about the output of an extraction application. Given an interface with the necessary features, users can specify data in documents that the application should have found but did not, or found partially or completely in error. Under some circumstances, such feedback can be used for immediate adaptation of the underlying extraction system, where the application's accuracy is improved by automatically adjusting its behavior in response to the feedback.
In any case, when user feedback is available, it can be used by developers to make improvements to the application. Such improvements are deployed in subsequent versions of the application, whether the goal is to improve the application's performance, or to adapt or revise the application for a different domain or set of information extraction targets.
Traditionally IE applications have been developed by labor-intensive construction of hand-crafted rules; and more recently by applying machine-learning techniques on the basis of hand-annotated document sets, or by some combination of the two approaches. Both approaches have proved to be expensive and time-consuming, to demand significant discipline and quality control, and to require extensive domain knowledge and specialized expertise. IE applications have consequently been costly and hard to develop, maintain, and customize for specific or different environments or needs. These factors have limited the market for IE applications to organizations or companies with significant financial resources, and to information tasks for which the financial return on the investment made in development is high
In addition to the complexities already mentioned, there are numerous ways an IE application may need to be customized or adapted. For example, a developer must determine which document structures (such as headings, sections, lists, or tables) or genres (E-mails, letters, or reports) the IE application should treat in a specific manner, or even ignore. Solutions to this problem, in existing systems, are often fragile and difficult to generalize since they are written for a specific application, domain, site, user, genre, or document structure. A developer must also determine which linguistic components (such as lexicons, word tokenization, morphology, and syntactic analysis) must be created or modified to deal with the unique linguistic properties of documents for the proposed extractions in the proposed domains. As a rule, linguistic components do not produce equally good results for all domains and document genres. For example, the style, vocabulary, and syntax of medical documents differ significantly from that of news articles. Linguistic components tuned to perform well in one domain are often less accurate in other domains.
A developer must likewise determine which specific domain- or site-specific information elements and relationships (such as persons, organizations, places, and other entities, times, events, and relationships among them) should be extracted. Experience has demonstrated that IE for a given information element developed for one domain often does not perform well in another domain, or even for another information source in the same domain. Furthermore, different domains often require completely different extraction targets. For instance, a biomedical application may be interested only in biochemical and genetic information while a business application may be interested only in stock prices.
A developer must also determine how IE targets and associated concepts should be organized: that is, the developer must create an “ontology” of the concepts relevant to the information extraction task. An ontology organizes and disciplines the development process (specifying the extraction categories, how are they defined, and how they relate to each other) and also provides inferencing capabilities for the IE application and applications built on top of the IE application. For example, in the ontology for a medical IE application, if “diabetes mellitus” is an “endocrine system disorder”, it is possible to relate it to “acromegaly” and “hypothyroidism” and vice versa since they are also endocrine system disorders. Ontological relationships make it much easier to normalize, organize, and relate extracted entities; and consequently to search and navigate across them. Furthermore, medical ontologies such as SNOMED International SNOMED CT, a complex clinical medical nomenclature, possess rich semantic inter-connections to many other types of medical knowledge and allow a user to relate, for example, “diabetes mellitus” to the “pancreas” (anatomical site) and “insulin” (in two ways: deficient production of this hormone results in diabetes; and insulin injections—a medication—are used to treat diabetes).
At present, developing, customizing, or adapting an IE application demands weeks or months of labor by highly skilled specialists. Substantially shorter times, less expertise, and significantly less effort are necessary for IE applications to find a wider audience.
Machine-learning classifiers have been demonstrated to be highly successful techniques for identifying targets of interest for information extraction such as entities (persons, places, organizations), events, times, and relationships among them. Nevertheless, they are still not commonly used in commercial IE applications principally because of the difficulties and associated expense in obtaining sufficient labeled training data.
Information extraction research has also demonstrated how large unlabeled document collections and targeted developer feedback (such as in “active learning”) can be used to train production classifiers either singly or in combination. These techniques likewise have been rarely employed in commercial IE applications. The result is that, even when classifiers are used, they are typically created during the development process and are subsequently “frozen,” that is, treated as static components in the deployed application. It is well recognized that natural language systems cannot anticipate the diversity and complexity of linguistic expression. This is the principal reason that text and speech applications incorporate adaptation and feedback techniques. For example, spell checkers include at a minimum a “user dictionary” for words not found in the standard production word list. Speech recognition systems perform regular acoustic and language model adaptation to align themselves with the speech patterns of their users. These adaptive features increase the usability of such applications when they are deployed in specific environments for specific tasks. In contrast, IE applications errors may be so noticeable or frustrating, that users—in the absence of any techniques to reduce these errors—may abandon an application entirely as defective. There is therefore a need for an IE application that can adapt to the data it works on and the behavior of its users, showing improvement when mistakes are detected and corrected. For example, an IE application could learn from its successes and from its mistakes, such as when a person name has been mislabeled as an organization name, a company name has not been properly normalized, or an “employee-of” relationship between a person and a company is mistaken or missing.
One factor that has limited the exploitation of user feedback in information extraction applications is the difficulty of discerning the source of the error in the complex cascade of prior decisions that produced the erroneous result. Even if the source of the error can be established, it is unlikely that users, as opposed to highly skilled developers and information extraction experts, will be able to know how to modify the system or propose which application component should be adapted with the user feedback.
Furthermore, users often want to understand how complex IE applications make decisions. Providing explanations for the results of information extraction applications that rely on a complex cascade of analyses is very difficult even for someone intimately knowledgeable about the workings of the given IE application.
Documents are not just continuous sequences of words. Documents are organized through the use of text, text attributes, punctuation, whitespace, and graphics to separate, consolidate, and create relationships among the meaningful text elements that constitute the content of documents. Some document structures are so widely used that they have almost become formulas (for example, the address blocks and time expressions of business letters). Some document structures are encoded in consistent and standardized formats, such as E-mail headers, because purely automatic means exploit these formats to route documents over the Internet. However, the usual situation is far more complex.
For example, SEC financial reports must observe certain requirements for what kinds of financial information must be reported. Nevertheless, the organization and presentation of this information is as varied as the companies who file them. This problem can be reduced significantly if only a few sections of these financial reports are to be analyzed; all the same, the amount of variation in, say, profit and loss statements, is still daunting.
Irregularity and lack of uniformity of document structure is the rule rather than the exception. Document format standards are almost entirely absent in almost all fields in which information extraction is employed. Of course, information extraction applications are designed to process unstructured or partially structured documents. However, most information extraction applications have a very limited ability to accurately identify, categorize, normalize, and generally manage document structure and the content it encompasses. This is unfortunate because “unstructured” documents do typically have structure that is relevant to information extraction. Information extraction applications would benefit if the structure of documents were easily and reliably available. There are some sections of documents that are significant for information extraction. For example, the headers and footers of medical documents contain demographic metadata such as department name, report type, patient and physician name, patient age, patient sex, and so forth that may be important to extract. This metadata may in turn be critical to evaluating the content of the document: for example, progress notes normally are structured quite differently from discharge summaries. Similarly, breast cancer for a female patient is coded with a different ICD-9-CM medical billing code than breast cancer for a male patient.
A reliable source of information is also important. Some structural elements reliably contain information essential for information extraction. For example, E-mail headings contain the time the E-mail was sent. This time, in turn, can be used to resolve time references in the body of the E-mail (for example, “today”, “yesterday”, or “last week”). The “Medications” section of a medical discharge summary is generally a reliable source of current medications.
Some structural elements of documents should be ignored (or “filtered”) during information extraction. For example, page turns and their associated document headers and footers should be ignored so that text is contiguous and uninterrupted. Furthermore, a selective extraction technique is usually desired: different sections of documents may need to be processed or interpreted in different ways. For example, a physician concerned with the on-going treatment of a patient may not want to give prominence to the diseases or procedures mentioned in the “History” sections of medical documents.
Other types of semi-structured data require special processing. Section headings should not be run into the text that precedes follows them. Lists should be interpreted as sequences of individual items. The elements of tables should be interpreted as items identified by column and row labels.
In spite of the evident significance of document structure analysis to information extraction, present IE applications typically have weak and fragile methods for managing or incorporating document structure in their processing. Ad hoc labor-intensive manual techniques are the state of the art. This process involve inspecting a set of (hopefully representative) examples of the documents to be processed and programming by trial and error a set of text processing programs or “scripts” to identify the data relevant for information extraction. This process is laborious, error-prone, and time-consuming. Furthermore, the resulting analysis frequently fails to generalize to other sets of documents, with the result that this process must be repeated for each document genre encountered.
In addition, the structure of the documents being processed may change (e.g., a financial report may be required to present new types of financial information), the structure processing scripts must be re-inspected and possibly revised. These changes usually take place unannounced, so an information extraction application may be ineffectual for some time until the document structure change is recognized and taken into account.
There is consequently a need for a document structure analysis system designed to remedy these recognized deficiencies of state-of-the-art systems.
A persistent problem for developers and users of IE applications is that they are not always fully aware of the nature of the documents to be analyzed and about the information extraction targets that ought to be sought. In addition to the previously discussed techniques for document structure analysis, one effective development method is to employ semantic exploration techniques. Semantic exploration engages unsupervised and partially supervised methods for identifying hidden, but salient semantic patterning in document collections. Semantic exploration assists IE in several ways: to understand the contents and organization of document collections used for developing IE applications; to reveal semantic categories that are potential extraction targets; to better understand the nature of already proposed extraction targets; and to provide sources of semantic data, such as “gazetteers” (lists of semantically related proper names), that play a role in creating an IE application.
Semantic exploration can be used to identify valuable extraction targets that are found in documents but were not considered important or even known to exist. For example, a collection of astronomy papers may also include the names and characteristics of astronomical instruments that are used to observe stellar spectrum data. Sometimes the categories identified by semantic exploration may not themselves be information extraction targets per se, but they can provide useful and sometimes highly reliable contexts for other information extraction targets. For example, an IE application may not be directly interested in place names or the names of saints, but place and saint names can often provide very reliable contexts for finding hospital names. Some semantic categories may be valuable because identifying them avoids confusing them with other extraction targets. Including such “negative” categories often improves the performance of IE applications. In addition, there is a need for semantic exploration techniques that can take advantage of structured knowledge sources such as ontologies and thesauri. This allows developers to identify concepts in documents that have been previously incorporated into these structured knowledge sources.
In spite of these evident benefits, to date IE applications have only infrequently integrated semantic exploration techniques for developing IE applications. It is desirable that such a semantic exploration system provide several ways for developers and users to gain semantic insight into documents targeted for information extraction. On the one hand, a semantic exploration system is needed with the ability to discover semantic categories in document collections quickly, without complex linguistic pre-processing, and demanding only limited developer input and feedback. On the other hand, a semantic exploration is also needed to explore patterning such as document clusters, document categories, similar documents, and related terms in document collections. This system should exploit but not require semantic knowledge bases and annotations, if available, and may require greater developer input and feedback.