The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for determining the answers to questions input to a Question and Answer (QA) system when the answer is not explicitly provided in the corpus of information operated on by the QA system.
With the increased usage of computing networks, such as the Internet, humans are currently inundated and overwhelmed with the amount of information available to them from various structured and unstructured sources. However, information gaps abound as users try to piece together what they can find that they believe to be relevant during searches for information on various subjects. To assist with such searches, recent research has been directed to generating Question and Answer (QA) systems which may take an input question, analyze it, and return results indicative of the most probable answer to the input question. QA systems provide automated mechanisms for searching through large sets of sources of content, e.g., electronic documents, and analyze them with regard to an input question to determine an answer to the question and a confidence measure as to how accurate an answer is for answering the input question.
One such QA system is the Watson™ system available from International Business Machines (IBM) Corporation of Armonk, N.Y. The Watson™ system is an application of advanced natural language processing, information retrieval, knowledge representation and reasoning, and machine learning technologies to the field of open domain question answering. The Watson™ system is built on IBM's DeepQA™ technology used for hypothesis generation, massive evidence gathering, analysis, and scoring. DeepQA™ takes an input question, analyzes it, decomposes the question into constituent parts, generates one or more hypothesis based on the decomposed question and results of a primary search of answer sources, performs hypothesis and evidence scoring based on a retrieval of evidence from evidence sources, performs synthesis of the one or more hypothesis, and based on trained models, performs a final merging and ranking to output an answer to the input question along with a confidence measure.
As QA systems, such as the Watson™ system, are built to answer complex questions, new data and literature is loaded into the system to fine tune the capabilities of the system, and to better answer such questions. Better data input into the QA system generally results in better answers from the system. The data input to the system may include structured and unstructured data such as documents, spreadsheets and presentations. The data may have already been review (e.g., evaluation and analysis, etc.) as part of a typical document editing lifecycle. The typical lifecycle includes a group of people creating, editing and reviewing the content included in a document. Various commonly used word processors have different processes used to track changes, or revisions, in documents during the lifecycle. These editing lifecycle features modify and add metadata to the underlying document. Oftentimes, the underlying document is rich with alternate versions, spellings and mistakes which have been corrected or modified during the lifecycle.