Since the 1980s, increasing sophistication of machine learning and computer technologies has enabled development of solutions to a variety of challenges facing the Natural Language Processing (NLP) community. Knowledge discovery systems can be of interest to commercial, industrial, and government organizations that utilize computer processing to perform transactions, evaluate consumer demands, and, in general, draw conclusions or make decisions that depend upon a knowledge base. Often, construction of such a knowledge base depends upon automatic extraction of relational information and, more fundamentally, related named entities (e.g., people, organizations) from a collection, or corpus, of text documents (e.g., e-mail, news articles). Consequently, reliability of these systems is susceptible to extraction errors.
Even state-of-the-art extraction tools/technologies, also referred to as extractors, can be vulnerable to variations in (1) source and domain of a corpus and its adherence to conventional lexical, syntactical, and grammatical rules; (2) availability and reliability of manually annotated data; and (3) complexity of semantic object types targeted for extraction. Under these and other challenging conditions, extractors can produce a range of interdependent errors that can distort output and fail to achieve adequate accuracy rates for practical use. However, many extractors, distinguished by the nature of their underlying algorithms, possess complementary characteristics that may be combined to selectively amplify their attractive attributes (e.g., low miss or false alarm rates) and reduce their respective errors.