The present invention relates to the field of information retrieval systems, and more particularly to Question Answering (QA) systems which retrieve or construct answers to queries using a corpus of documents or information.
It is fairly typical for queries (e.g. questions) to be posed in natural language, and so complex Natural Language Processing (NLP) techniques can be needed in order to correctly handle such queries. QA systems may therefore operate on an underlying natural language corpus (such as Wikipedia™) wherein content must first be ingested, processed, and analyzed by the system (e.g. using NLP techniques) in order to answer questions. Such QA systems can exhibit poor/low accuracy when the required information is missing from the underlying data source.
In the context of enterprise systems, this problem can be more pronounced, because the underlying corpus of documents or information is normally limited to the organization's immediate business area or internal processes (e.g. a specific domain) where documents are authored by a small number of experts. The following problems are therefore commonplace for domain-specific QA systems, especially when queries are composed from natural language:                Natural language expression from a large number of human users is far more varied than what can be described with the limited corpus of a domain-specific QA system;        Questions relating to background information on common concepts may not be answered correctly, as such data is often omitted from the corpus of a domain-specific QA system by subject-matter experts;        Questions about related information that does not exist in the corpus will either be answered in an incorrectly, incomplete, unhelpful or misleading manner, which could even cause a loss of revenue for example.        
To address such problems, various approaches have been proposed. One such approach relies on recognizing queries that are not related to the immediate context of the enterprise corpus and then handling the queries in a special manner. However, this is highly difficult when queries appear to be related to the context of the corpus but the corpus content is insufficient to generate a correct answer. Another approach that has been proposed is to manually expand the corpus of the system with extra hand-written data. Although this can provide a good solution in some cases, it is very expensive and requires significant investment from domain experts.
Yet another proposed approach comprises automatically expanding a corpus with general data from known open-domain data sources, such as Wikipedia™, Lexis Nexis™, DBPedia™, Streaming sources, etc. This can add large quantities of unrelated data unless it is done in a strategic manner. However, strategically expanding a corpus with high-quality related data from known sources is typically a time-consuming manual process, and therefore expensive. Additionally, the quality of the data is not as good as content authored by domain experts. Without an automatic method of assessing semantic relatedness, human error can also be a problem when expanding a corpus. For example, documents that look like appropriate expansion candidates, to a domain expert, may not help with the generation of answers that are not currently covered by a corpus.
Further issues have been identified in relation to improving the corpus of a domain-specific QA system. For example, selecting the most relevant and helpful external content to compliment a domain-specific (e.g. enterprise) corpus content is a non-trivial task, because, for example, most open-domain corpora typically contain millions of documents, and many significantly varied domains. Also, even for external domain-specific corpora, identifying related documents that are not currently covered by a domain-specific corpus is a significant challenge.
Expanding the system with too much unrelated data can also have a negative effect. The larger the total corpus ingested by a QA system, the greater the system complexity in the generation and ranking of answers, leading to a general reduction in accuracy and an increased demand for required computational resources (e.g. memory, disk storage, CPU usage, etc.), in order to process user queries in real-time.
Accordingly, there exists a problem of how to improve upon the abovementioned processes, and a solution that does this accurately and efficiently, and even automatically, would be of significant value.