1. The Field of the Present Invention
The present invention relates generally to an apparatus, system and method for exploring and organizing document collections. The present invention provides techniques for identifying related terms and for exploring relationships among concepts in a document collection using structured knowledge bases. Concepts may be represented by document meta data, annotations, or linguistic patterns identified in the document collection and structured knowledge bases.
2. General Background
Information extraction (IE) and text mining systems are natural language processing (NLP) systems to identify, normalize, and remove duplicate information elements found in documents. Information extraction systems are used to discover and organize the latent meaningful and fine-grained content elements of documents. These content elements include such entities as persons, places, times, objects, events, and relationships among them. For example, an information extraction task in finance and business might consist of processing business articles and press releases to identify and relate the names of companies, stock ticker symbols, and employees and officers, times, and events such as mergers and acquisitions. These information elements are suitable for storage and retrieval by database and information retrieval systems. In the finance and business example, these data might be used to alert investors, bankers, and brokers of significant business transactions.
Information extraction is related to but distinct from information retrieval (IR). Information retrieval is concerned with searching and retrieving documents or document passages that correspond to a user's query, usually supplied in natural language as a few terms or even a question. Document clustering and classification are related natural language processing (NLP) techniques that can provide other types of high-level document navigation aids to complement IR by organizing documents into meaningfully related groups and sub-groups based on content. Additional related NLP technologies are document summarization, which attempts to find the passages of one or more documents that characterize their content succinctly or generate summaries based on these passages, and question answering, which attempts to find passages in documents or construct answers from documents that represent the answers to questions such as “When was Abraham Lincoln born?” or “Why is the sky blue?”
Information extraction plays a role in IR because it identifies and normalizes information in natural language documents and thereby makes this information searchable. It also brings information retrieval closer to fielded database search because the diversity of expression in text documents has been disciplined through normalization. In the mergers and acquisitions example, the names of companies, persons, products, times, and events would be represented in a uniform manner. This makes it significantly easier to identify business activities for a given company such as IBM even if the original texts had many different ways of mentioning the company (e.g., “IBM”, “International Business Machines Corporation”, “International Business Machines”).
Information extraction systems have traditionally been developed by labor-intensive construction of hand-crafted rules; and more recently by applying machine-learning techniques on the basis of hand-annotated document sets. Both approaches have been expensive, time-consuming, demand significant discipline and quality control, and demand extensive domain knowledge and specialized expertise. Information extraction systems have consequently been difficult and costly to develop, maintain, and customize for specific or different environments or needs. This has therefore limited the audience for information extraction systems.
There are numerous ways an information extraction system needs to be customized or adapted. For example, information extraction systems are typically customized to determine which document structures (such as headings, sections, lists, or tables) or genres (E-mails, letters, or reports) should be treated in a specific manner or ignored. Solutions to this problem, in existing systems, are often fragile and difficult to generalize since they are written for a specific application, domain, site, user, genre, or document structure.
In addition, the linguistic components of information extraction systems (such as lexicons, word tokenization, morphology, and syntactic analysis) must often be customized to deal with the unique language properties of documents in the proposed domains. It is sometimes claimed that generalized linguistic components produce good results irrespective of the domain or genre, but experience does not support this contention. For example, the kind of language found in medical documentation differs significantly from that found in news articles in vocabulary and syntax, among other things. Experience shows that linguistic components tuned to perform well in one of these domains tend are likely to be much less accurate in the other.
Furthermore, it also must be determined which domain- or site-specific information extraction elements and relationships (such as persons, organizations, places, and other entities, times, events, and relationships among them) should be extracted. Experience demonstrates that information extraction for a given entity developed for one domain often does not perform well in other domains. Different domains often demand completely different extraction targets. For instance, a biomedical application may be interested in biochemical and genetic information while a business application may be interested in stock prices.
Lastly, it is necessary to determine how the information extraction elements should be understood and related to each other in an ontology. An ontology organizes and disciplines the development process by defining the extraction categories and their interrelationships, and also provides inferencing capabilities for applications that use the output of an information extraction system. For example, if “diabetes mellitus” is an “endocrine system disorder”, it is possible to relate it to “acromegaly” and “hypothyroidism” and vice versa. Ontological relationships make it much easier to normalize, organize, and relate extracted entities; and consequently to search and navigate across them. Furthermore, rich medical ontologies such as SNOMED CT possess inter-connections to many other types of medical knowledge and allow a user to relate “diabetes mellitus” to the “pancreas” (anatomical site) and “insulin” (in two ways: deficient production of this hormone results in diabetes; and insulin injections are used to treat diabetes).
At present, developing, customizing, or adapting information extraction systems demands weeks or months of labor by highly skilled specialists. Substantially shorter times, less expertise, and significantly less effort are necessary for information extraction systems to find a wider audience.
Machine-learning classifiers and classifier ensembles have been used extensively in information extraction. They are highly successful techniques for identifying targets of interest for information extraction such as entities (persons, places, organizations), events, and times; and relationships among them.
It has become more and more common to use large unlabeled document collections and user feedback (for example, using “active learning” and “co-training”) to train production classifiers either singly or in combination. However, the resulting classifiers are typically “frozen” or “static” after this initial development. Specifically, these classifiers do not adapt or improve further from user feedback as the information extraction application generates results, and the user modifies or corrects information extraction results.
Furthermore, it is difficult, even for experts, to discern what may be the source of the error in the complex cascade of prior decisions that produced the erroneous result. Further, even if the source of the error can be discerned, it is unlikely that users, as opposed to highly skilled experts, will be able to know how to modify the system or propose which classifier should be adapted with the user feedback.
Finally, users often want to understand how complex systems make decisions. Providing explanations for the results of information extraction applications that rely on a complex cascade of analyses is very difficult even for someone intimately knowledgeable about the workings of the given information extraction application.
Semantic exploration and discovery (SED) refers to a range of unsupervised and supervised methods for identifying salient latent semantic patterning in document collections. SED results play two important roles in information extraction: to assist in understanding and organizing the content of document collections; and to reveal the latent semantic categories that might play a role in designing an information extraction system.
Developers of information extraction systems are not always fully aware of the nature of the documents to be analyzed nor about the information extraction targets to be sought. SED lets “the data speak for itself” to the developers. A developer typically starts the development of an information extraction application with a rough notion of informed extraction targets and how they manifest themselves in natural language. An analysis of a document set might reveal that it contains additional information extraction targets that could be helpful to the application's users. For example, a collection of astronomical papers may include the names and characteristics of astronomical instruments that complement stellar spectrum data. Furthermore, the proposed information extraction targets may manifest themselves in natural language quite differently or unexpectedly in a document set. This may suggest modifications to the definitions of the information extraction targets and how and where they are to be extracted. SED therefore can play a significant role in the development of information extraction systems.
First, SED can be used to create an initial pool of relevant examples for the IE system by identifying information extraction targets that users will find valuable to identify regularly in new documents.
Second, SED can be used to identify supporting contexts that can improve the performance of an IE system. Some words, phrases, text patterns, and other linguistic contexts may not be significant as information extraction targets per se, but they may be helpful to the information extraction process itself. For example, an information extraction system may not be directly interested in place names, but lists of place names may indirectly provide reliable contexts for extraction patterns for other information extraction targets, such as the names of hospitals and government buildings. Similarly, if in a medical information extraction application it is observed that current medications are reliably dictated in a limited number of document sections, for instance, the medications and plan sections, then identifying these sections can markedly improve the accuracy of medications extractions.
Finally, SED can be used to set negative examples where some categories may be valuable precisely because they should be excluded as categories for extraction targets (that is, they are “negative” evidence). In other words, the accuracy of an IE application can be improved by reliably excluding text content that the information extraction target cannot be. For example, by reliably identifying Social Security and telephone numbers, an information extraction application reduces the size of the pool of hyphenated numbers such as year ranges and IDs and ensures that they are not misidentified as Social Security and telephone numbers.
SED methods can also take advantage of structured knowledge sources such as ontologies, taxonomies, and thesauri. SED methods provide two ways for developers and users to gain semantic insight into documents targeted for information extraction: lightweight and heavyweight. Lightweight SED methods perform fast semantic analyses of document collections by eschewing complex linguistic and statistical pre-processing. Heavyweight SED produces richer and generally more reliable semantic analyses of document collections, but at the expense of complex linguistic and statistical pre-processing.
2. Deficiencies of the Prior Art
When developing a natural language processing (NLP) application, it is essential to understand how the document collection is organized by format and content, what concepts are found in the document collection, how they are expressed, and how these concepts relate to each other.
A developer must use only documents that are relevant to the particular application for development and evaluation. For example, the developer of a financial reporting NLP application should use only financial and business articles from a newspaper or journal document collection. To do this, the developer needs accurate and efficient techniques for clustering and categorizing documents by topic, task-specific graphical user interfaces to display clustering and categorization information, and additional techniques to improve categorization and clustering through user review and feedback.
A developer must understand what concepts are found in these documents. As a rule, information extraction application development begins speculatively about what concepts are found in the target document collection and how they are expressed there. What is really found in the document collection, however, is an empirical issue. For this, the developer needs to explore the document collection, identify important semantic categories, understand the ways in which these semantic categories are expressed, and, if appropriate, create lists of terms semantically relevant to the information extraction targets. For example, a developer may want to start his investigation of a document collection by collecting information about the desired information extraction targets. One common approach to this is to identify a set of documents that are retrieved by a query, and display those terms that are most strongly associated with that query in context. A developer may want to find terms in the document collection that are most strongly associated with a given cluster of documents. A developer may want to quickly reveal the many ways a particular concept is expressed. These tasks would be prohibitively time-consuming if they required the developer to read and review documents manually. Using a standard information retrieval tool provides very little improvement in productivity since the developer must still review each document. To perform this task efficiently, the developer needs accurate techniques for identifying semantically related terms and documents, task-specific graphical user interfaces to display these semantically related terms and documents, and additional techniques to improve accuracy of identifying semantically related terms and documents through user review and feedback.
A developer sometimes must organize the documents and the concepts found in the document collection according to a structured knowledge base. These may be highly structured knowledge bases, such as an ontology, taxonomy, or thesaurus, or only partially structured, such as a dictionary or topic-specific reference works or manuals. For example, medical discharge reports are often coded for billing purposes using complex medical administrative coding systems such as ICD-9-CM and CPT-4. To accomplish this task the developer needs accurate and efficient techniques for relating concepts found in documents to concepts found in knowledge bases, task-specific graphical user interfaces to display this information, and additional techniques to improve accuracy of this information through user review and feedback.
Reference works such as dictionaries, taxonomies, thesauruses, and technical reference works and manuals and textbooks are partially structured knowledge bases and they may be used to create searchable knowledge bases. These applications may be used directly as searchable standalone reference works with extensive semantic query and navigation capabilities or as an adjunct to information retrieval to improve the quality of queries. Both of these applications provide a partial solution to one of the fundamental problems facing information extraction and semantic search: how to create a rich structured knowledge base (ontology) for a given field of interest efficiently.
There are many well-understood and widely used natural language processing techniques for searching, categorizing, clustering, and summarizing document collections. More recently, techniques have been developed for searching document collections semantically, from the point of view of a structured knowledge base such as an ontology, taxonomy or thesaurus. These recent techniques improve information retrieval by enhancing queries with additional semantically relevant terms. This functionality is valuable for document retrieval.
However, searching, categorizing, clustering, finding related concepts, and other semantic navigation and organizational methods have one unavoidable problem: how to search, organize or navigate a document collection when it is not clear what information is actually in the collection. This problem is found in search, when a user attempts to determine which terms should be used to construct a query for a given concept; in categorization, when a developer attempts to determine which categories of information appear in the document collection and organize the documents into coherent groupings; and in document clustering, when a developer attempts to determine which document are strongly associated with each other and how these associations are related to the semantic content and structure of these documents. Furthermore, these tasks are not independent of a user's information needs. For instance, an identical collection of business articles may be categorized entirely differently by an economist looking for economic trends than by a sociologist looking for data about consumer behavior.
Most proposed solutions to these problems have proven unsatisfactory. Predetermined or fixed sets of categories are often unrelated to or unaligned with a user's information needs. Unsupervised methods such as document clustering produce results that more often than not do not align with a user's intuitions or information needs. Document categorization requires representative sets of already categorized documents, but simply shifts the question to how the initial document categorization was done and whether or not it aligns with the user's needs. It is not surprising, then, that developing domain- or application-specific knowledge bases has proved very complex and expensive.
Finally, independent techniques exist for performing these tasks, but they are typically done in isolation. None of the prior art combines these into a single application development environment, provides task-specific graphical interfaces and techniques to improve accuracy through user review and feedback, and uses a uniform representation underlying all of these techniques.