The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for using renaming directives to bootstrap industry-specific knowledge and lexical resources.
Documents include information in many forms. For example, textual information arranged as sentences and paragraphs convey information in a narrative form. Some types of information are presented in a referential form. For example, a document can include a name, a word, a phrase, or a text segment that occurs repeatedly in the document. Many documents designate a replacement phrase or text to stand-in for the name, word, phrase, or text segment, and use the replacement text (the moniker) for each subsequent occurrence of the name, word, and phrase or text segment (the full name expression) after the first occurrence.
Natural language processing (NLP) is a technique that facilitates exchange of information between humans and data processing systems. For example, one branch of NLP pertains to transforming a given content in a human-usable language or form to a computer usable form. For example, NLP can accept a document whose content is in a human-readable form, and produce a document whose corresponding content is in a computer-specific language or form.
NLP is used in many different ways including recent use in Question and Answer (QA) systems. That is, with the increased usage of computing networks, such as the Internet, humans are currently inundated and overwhelmed with the amount of information available to them from various structured and unstructured sources. However, information gaps abound as users try to piece together what they can find that they believe to be relevant during searches for information on various subjects. To assist with such searches, recent research has been directed to generating Question and Answer (QA) systems which may take an input question, analyze it using a variety of techniques including NLP techniques, and return results indicative of the most probable answer to the input question. QA systems provide automated mechanisms for searching through large sets of sources of content, e.g., electronic documents, and analyze them with regard to an input question to determine an answer to the question and a confidence measure as to how accurate an answer is for answering the input question.
One such QA system is the Watson™ system available from International Business Machines (IBM) Corporation of Armonk, N.Y. The Watson™ system is an application of advanced natural language processing, information retrieval, knowledge representation and reasoning, and machine learning technologies to the field of open domain question answering. The Watson™ system is built on IBM's DeepQA™ technology used for hypothesis generation, massive evidence gathering, analysis, and scoring. DeepQA™ takes an input question, analyzes it, decomposes the question into constituent parts, generates one or more hypothesis based on the decomposed question and results of a primary search of answer sources, performs hypothesis and evidence scoring based on a retrieval of evidence from evidence sources, performs synthesis of the one or more hypothesis, and based on trained models, performs a final merging and ranking to output an answer to the input question along with a confidence measure.
Various United States Patent Application Publications describe various types of question and answer systems. U.S. Patent Application Publication No. 2011/0125734 discloses a mechanism for generating question and answer pairs based on a corpus of data. The system starts with a set of questions and then analyzes the set of content to extract answer to those questions. U.S. Patent Application Publication No. 2011/0066587 discloses a mechanism for converting a report of analyzed information into a collection of questions and determining whether answers for the collection of questions are answered or refuted from the information set. The results data are incorporated into an updated information model.
Lexical resources are utilized to label/categorize/interpret individual tokens or sequences of tokens in text, prior to applying structural/grammatical analysis to discover additional relations between the tokens within a larger phrase or sentence. Examples of lexical resources are:                Dictionaries: contain features such as gender, part-of-speech, semantic category/type of common words in a language. Dictionaries are also referred to as ‘Lexicon’ in language processing terminology.        Gazeteers: a special kind of dictionary for proper names; a gazeteer indicates what semantic category the name is an instance of, e.g., Person, Municipality, Geographical Region, etc., and possibly gender and other features.        Ontology: an inventory of semantic categories/types, typically organized as a hyponym/hypernym tree (e.g. “Basenji is a kind of Hound”).        Selectional restrictions: words such as prepositions and verbs become predicates in a relation tuple, and the argument positions in those predicates can sometimes be filled only by entities of particular semantic categories. For example, a plant can wilt but a car cannot. The dictionary entry for wilt may be augmented to reflect this semantic affinity with plants.        
One key purpose for these resources is to record the affinity between words that specify individuals (proper names “Petey” or common nouns such as “Dog”) at the base instance level and also type/category symbols. Many language processing tasks require the system to make an inference between instances and categories. In Question Answering systems in particular, that task is very high priority, because the Question often expresses a category restriction, e.g., “What Eastern European artist wrapped the Reighstag in 1995” and candidate answers must be judged as belonging to the category or not (in this example, each candidate answer would be scored as to its likelihood to fit the category ‘Eastern European artist’).
Documents from specialized domains utilize terminology that is not in standard dictionaries/gazeteers as well as novel semantic types that are not in standard ontologies. For example, company-internal documents describe business units, products, processes, etc. Legal or medical documents include jargon not familiar to non-practitioners. An NLP application that needs to process documents with such specialized vocabulary will encounter words not present in the system's given lexical resources.