Over the years, computer systems designers have pursued the challenge of developing computational architectures that have the ability to generate answers to freely-posed questions. General question-answering systems typically depend on automated processes for analyzing questions and for composing answers from a large corpus of poorly structured information. In recent years, systems have been developed that employ the resources of the Web as a corpus of information for answering questions. Web-based question answering systems typically employ rewriting procedures for converting components of questions into sets of queries posed to search engines, and converting query results received from the search engines into one or more answers.
Many text retrieval systems, for example, operate at the level of entire documents. In searching the web, complete web pages or documents can be returned. There has been a recent surge of interest in finer-grained analyses focused on methods for obtaining answers to questions rather than retrieving potentially relevant documents or best-matching passages from queries—tasks information retrieval (IR) systems typically perform. The problem of question answering, however, hinges on applying several key concepts from information retrieval, information extraction, machine learning, and natural language processing (NLP).
Automatic question answering from a single, constrained corpus is extremely challenging. Consider the difficulty of gleaning an answer to the question “Who killed Abraham Lincoln?” from a source which contains only the text “John Wilkes Booth altered history with a bullet. He will forever be known as the man who ended Abraham Lincoln's life.” As can be appreciated, however, question answering is far easier when the vast resources of the Web are brought to bear, since hundreds of Web pages contain the literal string “killed Abraham Lincoln.”
Most approaches to question answering use NLP techniques to augment standard information retrieval techniques. Systems typically identify candidate passages using IR techniques, and then perform more detailed linguistic analyses of the question and matching passages to find specific answers. A variety of linguistic resources (part-of-speech tagging, parsing, named entity extraction, semantic relations, dictionaries, WordNet, etc.) can be employed to support question answering.
In contrast to these rich natural language approaches, others have developed question answering systems that attempt to solve the difficult matching and extraction problems by leveraging large amounts of data. In one such system, redundancy provided by the web can be exploited to support question answering. Redundancy, as captured by multiple, differently phrased answer occurrences, facilitates question answering in two key ways. First, the larger the information source, the more likely it is that answers bearing close resemblance to the query can be found. It is quite straightforward to identify the answer to “Who killed Abraham Lincoln?” given the text, “John Wilkes Booth killed Abraham Lincoln in Ford's theater.” Second, even when no exact answer can be found, redundancy can facilitate the recognition of answers by enabling procedures to accumulate evidence across multiple matching passages. In order to support redundancy however, a plurality of variously phrased queries may have to be submitted to one or more search engines. This type of approach may place an unacceptable performance burden or load on search engines responding to the query especially considering the number of users that potentially utilize network resources.