The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for performing context based synonym filtering for natural language processing systems.
Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human-computer interaction. Many challenges in NLP involve natural language understanding, i.e. enabling computers to derive meaning from human or natural language input.
Modern NLP algorithms are based on machine learning, especially statistical machine learning. The paradigm of machine learning is different from that of most prior attempts at language processing. Prior implementations of language-processing tasks typically involved the direct hand coding of large sets of rules whereas the machine-learning paradigm calls instead for using general learning algorithms (often, although not always, grounded in statistical inference) to automatically learn such rules through the analysis of large corpora of typical real-world examples. A corpus (plural, “corpora”) is a set of documents (or sometimes, individual sentences) that have been hand-annotated with the correct values to be learned.
Many different classes of machine learning algorithms have been applied to NLP tasks. These algorithms take as input a large set of “features” that are generated from the input data. Some of the earliest-used algorithms, such as decision trees, produced systems of hard if-then rules similar to the systems of hand-written rules that were then common. Increasingly, however, research has focused on statistical models, which make soft, probabilistic decisions based on attaching real-valued weights to each input feature. Such models have the advantage that they can express the relative certainty of many different possible answers rather than only one, producing more reliable results when such a model is included as a component of a larger system.
One type of NLP system is a search engine, such as an Internet search engine, e.g., Google™, Yahoo!™, or the like. Such search systems receive one or more terms and search a corpus of content for matching terms and return results indicating the sources of content having the specified terms. In some instances, more advanced processing of search terms is performed which includes the implementation of NLP algorithms to improve the results generated by the search engine.
Another type of NLP system is a Question and Answer (QA) system which receives an input question, analyzes the input question using NLP algorithms, and returns results indicative of the most probable answer to the input question. QA systems provide automated mechanisms for searching through large sets of sources of content, e.g., electronic documents, and analyze them with regard to an input question to determine an answer to the question and a confidence measure as to how accurate an answer is for answering the input question.
One such QA system is the IBM Watson™ system available from International Business Machines (IBM) Corporation of Armonk, N.Y. The IBM Watson™ system is an application of advanced natural language processing, information retrieval, knowledge representation and reasoning, and machine learning technologies to the field of open domain question answering. The IBM Watson™ system is built on IBM's DeepQA™ technology used for hypothesis generation, massive evidence gathering, analysis, and scoring. DeepQA™ takes an input question, analyzes it, decomposes the question into constituent parts, generates one or more hypothesis based on the decomposed question and results of a primary search of answer sources, performs hypothesis and evidence scoring based on a retrieval of evidence from evidence sources, performs synthesis of the one or more hypothesis, and based on trained models, performs a final merging and ranking to output an answer to the input question along with a confidence measure.
Various United States patent application Publications describe various types of question and answer systems. U.S. Patent Application Publication No. 2011/0125734 discloses a mechanism for generating question and answer pairs based on a corpus of data. The system starts with a set of questions and then analyzes the set of content to extract answer to those questions. U.S. Patent Application Publication No. 2011/0066587 discloses a mechanism for converting a report of analyzed information into a collection of questions and determining whether answers for the collection of questions are answered or refuted from the information set. The results data are incorporated into an updated information model.