Embodiments disclosed herein relate to deep question answering systems. More specifically, embodiments disclosed herein relate to reducing concept “noise” in deep question answering systems.
In deep question answering systems (Deep QA), like IBM's Watson®, analysis programs are used to identify ontological information (i.e., concepts and their relationships within a domain) in both the question being posed and in one or more candidate answers. For example, an analysis program may identify medical concepts with the aid of a specific medical ontology or knowledge base. When determining the correct answer, the system employs algorithms which attempt to match concepts from a given candidate answer to concepts contained in the question. These algorithms produce scores which are used by the Deep QA system to help it choose the correct answer with the highest degree of confidence.
However, it is currently very difficult to identify and reduce noise when matching and scoring concepts. A meaningless or noisy concept is one that is often identified in a candidate answer, but rarely matches to meaningful concepts in the question. Alternatively, a concept may be found to match concepts in the question, but does not contribute towards increasing confidence that an answer is correct or incorrect. This type of noise brings down the overall concept matching score for the candidate answer, and can ultimately result in the incorrect answer being chosen.
Inverse document frequency (IDF) scores have been used in current solutions in an attempt to identify the most significant or impactful terms in a document or collection of documents. IDF assigns higher scores to terms appearing less frequently. This approach is often useful, but does not always filter out noise, as low frequency terms are not necessarily significant when it comes to predicting right or wrong answers.