A significant number of natural language questions (e.g., “What is a hard disk”) are submitted to search engines on the web every day, and an increasing number of search services on the web specifically target natural language questions. For example, some services uses databases of pre-compiled information, metasearching, and other proprietary methods, while other services facilitate interaction with human experts.
Many web search engines typically treat a natural language question as a list of terms and retrieve documents similar to the original query. However, documents with the best answers may contain few of the terms from the original query and be ranked low by the search engine. These queries could be answered more precisely if a search engine recognized them as questions.
Often, it is not sufficient to submit a natural language question (e.g., “How do I tie shoelaces?”) to a search engine in its original form. Most search engines will treat such a query as a bag of terms and retrieve documents similar to the original query. Unfortunately, the documents with the best answers may contain only one or two terms present in the original query. Such useful documents may then be ranked low by the search engine, and will never be examined by typical users who do not look beyond the first page of results.
Consider the question “What is a hard disk?.” The best documents for this query are probably not company websites of disk storage manufacturers, which may be returned by a general-purpose search engine, but rather hardware tutorials or glossary pages with definitions or descriptions of hard disks. A good response might contain an answer such as: “Hard Disk. One or more rigid magnetic disks rotating about a central axle with associated read/write heads and electronics, used to store data . . . ”. This definition can be retrieved by transforming the original question into a query {“hard disk” NEAR “used to”}. Intuitively, by requiring the phrase “used to”, most search engines can be biased towards retrieving this answer as one of the top-ranked documents.
A number of systems aim to extract answers from documents. For example, certain systems process the documents returned by the information retrieval system to extract answers. Questions are classified into one of a set of known “question types” that identify the type of entity corresponding to the answer. Documents are tagged to recognize entities, and passages surrounding entities of the correct type for a given question are ranked using a set of heuristics. Other systems re-rank and postprocess the results of regular information retrieval systems with the goal of returning the best passages. There are systems that combine statistical and linguistic knowledge for question answering and employ sophisticated linguistic filters to postprocess the retrieved documents and extract the most promising passages to answer a question.
These systems above use the general approach of retrieving documents or passages that are similar to the original question with variations of standard TF-IDF term weighting schemes. The most promising passages are chosen from the documents returned using heuristics and/or hand-crafted regular expressions. This approach is not optimal, because documents that are similar to the question are initially retrieved. However, the user is actually looking for documents containing an answer and these documents may contain few of the terms used to ask the original question. This is particularly important when retrieving documents is expensive or limited to a certain number of documents, as is the case with web search engines.
Also related are methods for automatically expanding queries based on the relevance of terms in the top-ranked documents. One approach describes how to automatically expand a query based on the co-occurrence of terms in the query with the terms in the top-ranked documents for the original query. In general, automatic query expansion systems expand queries at run time on a query-by-query basis using an initial set of top-ranked documents returned by the information system in response to the original query.