The present disclosure relates generally to a question and answer system and more particularly to pre-processing questions before they are input into such a system.
Cognitive question/answer (QA) systems analyze a question posed by a user in natural language, access a database to return results indicative of the most probable answer to the input question, and then determine and formulate an answer for the user. In operation, users submit one or more questions through an application's user interface (UI) or application programming interface (API) to the QA system. In turn, the questions are processed to generate answers, which are then returned to the user. QA systems provide automated mechanisms for searching through large sets of sources of content, e.g., electronic documents, and analyze them with regard to an input question to determine an answer to the question and a confidence measure as to how accurate an answer is for answering the input question. Examples, of QA systems are Siri (trademark) from Apple Corporation, Cortana (trademark) from Microsoft Corporation, and the question answering pipeline of IBM Watson (trademark) from IBM Corporation.
The IBM Watson system is an application of advanced natural language processing, information retrieval, knowledge representation and reasoning, and machine learning technologies to the field of open domain question answering. The IBM Watson system is built on IBM's DeepQA (trademark) technology used for hypothesis generation, massive evidence gathering, analysis, and scoring. DeepQA takes an input question, analyzes it, decomposes the question into constituent parts, generates one or more hypotheses based on the decomposed question and results of a primary search of answer sources, performs hypothesis and evidence scoring based on a retrieval of evidence from evidence sources, performs synthesis of the one or more hypothesis, and based on trained models, performs a final merging and ranking to output an answer to the input question along with a confidence measure.
Standard QA systems work with single QA pairs, i.e. a specific question is paired with a specific answer in a one-to-one relationship. The system compiles a database of QA pairs, e.g., by ingesting content from one or more corpora comprising content from various data sources, or from training corpora and processes, so that when a user asks a question, the same question is searched for in the database, which if found has a corresponding paired answer, which can then be provided as a response to the user's question. However humans in natural language, e.g., when talking and also when inputting text through a keyboard, often tend not to pose a query as a single cohesive sentence. Rather, humans tend to formulate a query in the form of several sentences, or sentence fragments, some of which may be linguistically formulated as questions and others as statements intended to further specify the nature of the questions.