1. The Field of the Present Invention
The present invention relates generally to an apparatus, system and method for improving the quality of natural language processing applications. The invention consists of an ensemble of per-user, adaptive, on-line machine-learning classifiers that adapt to document content and judgments of users by continuously and automatically incorporating feedback from application results and corrections that users apply to these results.
2. General Background
Information extraction (IE) and text mining systems are natural language processing (NLP) systems to identify, normalize, and remove duplicate information elements found in documents. Information extraction systems are used to discover and organize the latent meaningful and fine-grained content elements of documents. These content elements include such entities as persons, places, times, objects, events, and relationships among them. For example, an information extraction task in finance and business might consist of processing business articles and press releases to identify and relate the names of companies, stock ticker symbols, and employees and officers, times, and events such as mergers and acquisitions. These information elements are suitable for storage and retrieval by database and information retrieval systems. In the finance and business example, these data might be used to alert investors, bankers, and brokers of significant business transactions.
Information extraction is related to but distinct from information retrieval (IR). Information retrieval is concerned with searching and retrieving documents or document passages that correspond to a user's query, usually supplied in natural language as a few terms or even a question. Document clustering and classification are related natural language processing (NLP) techniques that can provide other types of high-level document navigation aids to complement IR by organizing documents into meaningfully related groups and sub-groups based on content. Additional related NLP technologies are document summarization, which attempts to find the passages of one or more documents that characterize their content succinctly, and question answering, which attempts to find passages in documents or construct answers from documents that represent the answers to questions such as “When was Abraham Lincoln born?” or “Why is the sky blue?”
Information extraction plays a role in IR because it identifies and normalizes information in natural language documents and thereby makes this information searchable. It also brings information retrieval closer to fielded database search because the diversity of expression in text documents has been disciplined through normalization. In the mergers and acquisitions example, the names of companies, persons, products, times, and events would be represented in a uniform manner. This makes it significantly easier to identify business activities for a given company such as IBM even if the original texts had many different ways of mentioning the company (e.g., “IBM”, “International Business Machines Corporation”, “International Business Machines”).
Information extraction systems have traditionally been developed by labor-intensive construction of hand-crafted rules; and more recently by applying machine-learning techniques on the basis of hand-annotated document sets. Both approaches have been expensive, time-consuming, demand significant discipline and quality control, and demand extensive domain knowledge and specialized expertise. Information extraction systems have consequently been hard and costly to develop, maintain, and customize for specific or different environments or needs. This has therefore limited the audience for information extraction systems.
There are numerous ways an information extraction system needs to be customized or adapted. For example, information extraction systems are typically customized to determine which document structures (such as headings, sections, lists, or tables) or genres (E-mails, letters, or reports) should be treated in a specific manner, or ignored. Solutions to this problem, in existing systems, are often fragile and difficult to generalize since they are written for a specific application, domain, site, user, genre, or document structure.
In addition, the linguistic components of information extraction systems (such as lexicons, word tokenization, morphology, and syntactic analysis must often be customized to deal with the unique language properties of documents in the proposed domains. It is sometimes claimed that generalized linguistic components produce good results irrespective of the domain or genre, but experience does not support this contention. For example, the kind of language found in medical documentation differs significantly from that found in news articles in vocabulary and syntax, among other things. Experience shows that linguistic components tuned to perform well in one of these domains tend are likely to be much less accurate in the other.
Furthermore, it also must be determined which domain- or site-specific information extraction elements and relationships such as persons, organizations, places, and other entities, times, events, and relationships among them should be extracted. Experience demonstrates that information extraction for a given entity developed for one domain often does not perform well in other domains. Different domains often demand completely different extraction targets. For instance, a biomedical application may be interested in biochemical and genetic information while a business application may be interested in stock prices.
Lastly, it is necessary to determine how the information extraction elements should be understood and related to each other in an ontology. An ontology not only organizes and disciplines the development process (what are the extraction categories, how are they defined, and how do they relate to each other), but also provides inferencing capabilities for applications that use the output of an information extraction system. For example, if “diabetes mellitus” is an “endocrine system disorder”, it is possible to relate it to “acromegaly” and “hypothyroidism” and vice versa. Ontological relationships make it much easier to normalize, organize, and relate extracted entities; and consequently to search and navigate across them. Furthermore, rich medical ontologies such as SNOMED CT, possess inter-connections to many other types of medical knowledge and allow a user to relate “diabetes mellitus” to the “pancreas” (anatomical site) and “insulin,” specifically in two ways, namely deficient production of the hormone results in diabetes and insulin injections are used to treat diabetes.
At present, developing, customizing, or adapting information extraction systems demands weeks or months of labor by highly skilled specialists. Substantially shorter times, less expertise, and significantly less effort are necessary for information extraction systems to find a wider audience.
Machine-learning classifiers and classifier ensembles have been used extensively in information extraction. They are highly successful techniques for identifying targets of interest for information extraction such as entities (persons, places, organizations), events, and times; and relationships among them.
It has become more and more common to use large unlabeled document collections and user feedback (for example, using “active learning”) to train production classifiers either singly or in combination. However, the resulting classifiers are typically “frozen” or “static” after this initial development. Specifically these classifiers do not adapt or improve further from user feedback as the information extraction application generates results or the user modifies or corrects information extraction results.
Furthermore, it is difficult, even for experts, to discern what may be the source of the error in the complex cascade of prior decisions that produced the erroneous result. Further, even if the source of the error can be discerned, it is unlikely that users, as opposed to highly skilled experts, will be able to know how to modify the system or propose which classifier should be adapted with the user feedback.
Finally, users often want to understand how complex systems make decisions. Providing explanations for the results of information extraction applications that rely on a complex cascade of analyses is very difficult even for someone intimately knowledgeable about the workings of the given information extraction application.
3. Deficiencies of the Prior Art
Meta machine-learning methods such as classifier bagging, boosting, stacking, co-training, and combination have demonstrated improved prediction quality over the best stand-alone machine-learning algorithms. Other meta machine-learning techniques that involve user review and feedback such as active learning have been developed to reduce the time and cost of developing machine-learning applications. These techniques provide a repertoire of machine-learning techniques that not only have improved accuracy, but are also better adapted to the anticipated environment of use of the machine-learning application. However, these techniques have been applied only to developing a machine-learning application, but not to improving an already developed and deployed application. As a rule, natural language processing systems developed using machine-learning techniques are fixed and immutable. Once a natural language processing application is deployed, it is fixed until a new version is developed and deployed, even though the documents it is intended to process may be different or may change markedly over time; even though the user's vocabulary, syntax, and style may be substantially different from anything seen during development; even though the user's behavior on the given task may differ from what the development data would predict; and even though the task may change. When this happens, an application may perform very poorly. This can result in lowered productivity, reduced user satisfaction, and sometimes in the abandonment of the application entirely. At present, the sole recourse is to have the development staff identify a set of important problems experienced by the user and modify the system accordingly. Over time the number of users increase and the variability and complexity of document collections and user environments grow and these requests become correspondingly more frequent and difficult to manage. Furthermore, the problems also become more and more difficult and time-consuming to remedy since a repair to one user's problems may cause errors elsewhere for the same or other users.
The sole exception to this practice in natural language processing applications has been speech recognition. Single-user large-vocabulary speech recognition applications use acoustic and language model adaptation first when the user begins using a speech recognition system and then periodically as the user creates more and more dictated text. Adaptation is not done continuously because the underlying speech recognition statistical models do not adapt well with small increments of training data and the computational effort of updating acoustic models is high. Speech recognition systems do typically become better adapted to their users and have reduced error rates, but the adaptation process may happen only after a considerable amount of dictation has taken place. One consequence of this delay is that a user is forced to correct the same recognition errors repeatedly until sufficient data has been collected to train the speech recognition system not to make these errors.
Functionality that continuously tunes the application using information about the user's document collection, environment of use, and behavior is therefore needed.