In question answering (QA) systems such as described in http://en.wikipedia.org/wiki/Question_answering, it is common to locally store and index text collections, e.g., encyclopedias and newswire corpora, that are expected to provide reasonable coverage of the information required for a given QA task. However, compared to the Web, these sources are less redundant and may not contain the information sought by a question.
QA systems need to be improved with regard to the following common types of failures: 1) Source failures, i.e. the sources do not contain the information sought by a question. 2) Search and candidate extraction failures, i.e. the system is unable to retrieve or extract a correct answer, often because of insufficient keyword overlap with the question. 3) Answer ranking failures, i.e. the answer is outscored by a wrong answer, often because of insufficient supporting evidence in the sources or because it was ranked low in the search results.
Performing query expansion or using pseudo-relevance feedback in the search can (at most) address the above-mentioned failures of types 2) and 3). In practice, these approaches can introduce noise and may hurt QA performance. Often, they are only applied as a fallback solution if an initial query yields low recall.
While current web search engines typically must be used as black boxes, local sources can be indexed with an open-source IR system such as Indri or Lucene which provide full control over the retrieval model and search results. Local sources can also be preprocessed and annotated with syntactic and semantic information, which can be leveraged in structured queries that better describe the information need expressed in a question. Furthermore, in applications where speed and high availability are important, where the knowledge sources contain confidential data or restricted-domain knowledge, or where a self-contained system is required, a live web search may be infeasible. Moreover, the Web and algorithms used by web search engines change constantly.
While QA systems often utilize the Web as a large, redundant information source, it has also been noted in the QA research community that there are situations where a local search is preferable. For instance, Clarke et al. in the reference entitled “The impact of corpus size on question answering performance” (In Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2002) analyze the impact of locally stored web crawls on a TREC QA dataset. It has been found that large crawls of over 50 GB were required to outperform the 3 GB reference corpus used in TREC, and that performance actually declined if the crawl exceeded about 500 GB.
It would be highly desirable to provide an effective strategy for improving performance of a QA system.