This invention relates to search engines and other information retrieval tools.
With the explosive growth of information on the World Wide Web, there is an acute need for search engine technology to keep pace with users"" need for searching speed and precision. Today""s popular search engines, such as xe2x80x9cYahoo!xe2x80x9d and xe2x80x9cMSN.comxe2x80x9d, are used by millions of users each day to find information. Unfortunately, the basic search method has remained essentially the same as the first search engine introduced years ago.
Search engines have undergone two main evolutions. The first evolution produced keyword-based search engines. The majority of search engines on the Web today (e.g., Yahoo! and MSN.com) rely mainly on keyword searching. These engines accept a keyword-based query from a user and search in one or more index databases. For instance, a user interested in Chinese restaurants in Seattle may type in xe2x80x9cSeattle, Chinese, Restaurantsxe2x80x9d or a short phrase xe2x80x9cChinese restaurants in Seattlexe2x80x9d.
Keyword-based search engines interpret the user query by focusing only on identifiable keywords (e.g., xe2x80x9crestaurantxe2x80x9d, xe2x80x9cChinesexe2x80x9d, and xe2x80x9cSeattlexe2x80x9d). Because of its simplicity, the keyword-based search engines can produce unsatisfactory search results, often returning many irrelevant documents (e.g., documents on the Seattle area or restaurants in general). In some cases, the engines return millions of documents in response to a simple keyword query, which often makes it impossible for a user to find the needed information.
This poor performance is primarily attributable to the ineffectiveness of simple keywords being capable of capturing and understanding complex search semantics a user wishes to express in the query. Keyword-based search engines simply interpret the user query without ascribing any intelligence to the form and expression entered by the user.
In response to this problem of keyword-based engines, a second generation of search engines evolved to go beyond simple keywords. The second-generation search engines attempt to characterize the user""s query in terms of predefined frequently asked questions (FAQs), which are manually indexed from user logs along with corresponding answers. One key characteristic of FAQ searches is that they take advantage of the fact that commonly asked questions are much fewer than total number of questions, and thus can be manually entered. By using user logs, they can compute which questions are most commonly asked. With these search engines, one level of indirection is added by asking the user to confirm one or more rephrased questions in order to find an answer. A prime example of a FAQ-based search engine is the engine employed at the Web site xe2x80x9cAskjeeves.comxe2x80x9d.
Continuing our example to locate a Chinese restaurant in Seattle, suppose a user at the xe2x80x9cAskjeeves.comxe2x80x9d site enters the following search query:
xe2x80x9cWhat Chinese restaurants are in Seattle?xe2x80x9d
In response to this query, the search engine at the site rephrases the question as one or more FAQs, as follows:
How can I find a restaurant in Seattle?
How can I find a yellow pages listing for restaurants in Seattle, Wash.?
Where can I find tourist information for Seattle?
Where can I find geographical resources from Britannica.com on Seattle?
Where can I find the official Web site for the city of Seattle?
How can I book a hotel in Seattle?
If any of these rephrased questions accurately reflect the user""s intention, the user is asked to confirm the rephrased question to continue the searching process. Results from the confirmed question are then presented.
An advantage of this style of interaction and cataloging is much higher precision. Whereas the keyword-based search engines might return thousands of results, the FAQ-based search engine often yields a few very precise results as answers. It is plausible that this style of FAQ-based search engines will enjoy remarkable success in limited domain applications, such as web-based technical support.
However, the FAQ-based search engines are also limited in their understanding the user""s query, because they only look up frequently occurring words in the query, and do not perform any deeper syntactic or semantic analysis. In the above example, the search engine still experiences difficulty locating xe2x80x9cChinese restaurantsxe2x80x9d, as exemplified by the omission of the modifier xe2x80x9cChinesexe2x80x9d in any of the rephrased questions. While FAQ-based second-generation search engines have improved search precision, there remains a need for further improvement in search engines.
Another problem with existing search engines is that most people are dissatisfied with the user interface (UI). The chief complaint is that the UI is not designed to allow people to express their intention. Users often browse the Internet with the desire to obtain useful information. For the keywords-based search engine, there are mainly two problems that hinder the discovery of user intention. First, it is not so easy for users to express their intention by simple keywords. Second, keyword-based search engines often return too many results unrelated to the users"" intention. For example, a user may want to get travel information about Beijing. Entering xe2x80x98travelxe2x80x99 as a keyword query in Yahoo, for example, a user is given 289 categories and 17925 sites and the travel information about Beijing is nowhere in the first 100 items.
Existing FAQ-based search engines offer UIs that allow entry of pseudo natural language queries to search for information. However, the underlying engine does not try to understand the semantics of the query or users"" intention. Indeed, the user""s intention and the actual query are sometimes different.
Accordingly, there is a further need to improve the user interface of search engines to better capture the user""s intention as a way to provide higher quality search results.
A search engine architecture is designed to handle a full range of user queries, from complex sentence-based queries to simple keyword searches. The search engine architecture includes a natural language parser that parses a user query and extracts syntactic and semantic information. The parser is robust in the sense that it not only returns fully-parsed results (e.g., a parse tree), but is also capable of returning partially-parsed fragments in those cases where more accurate or descriptive information in the user query is unavailable. This is particularly beneficial in comparison to previous efforts that utilized full parsers (i.e., not robust parsers) in information retrieval. Whereas full parsers tended to fail on many reasonable sentences that were not strictly grammatical, the search engine architecture described herein always returns the best fully-parsed or partially-parsed interpretation possible.
The search engine architecture has a question matcher to match the fully-parsed output and the partially-parsed fragments to a set of frequently asked questions (FAQs) stored in a database. The question matcher correlates the questions with a group of possible answers arranged in standard templates that represent possible solutions to the user query.
The search engine architecture also has a keyword searcher to locate other possible answers by searching on any keywords returned from the parser. The search engine may be configured to search content in databases or on the Web to return possible answers.
The search engine architecture includes a user interface to facilitate entry of a natural language query and to present the answers returned from the question matcher and the keyword searcher. The user is asked to confirm which answer best represents his/her intentions when entering the initial search query.
The search engine architecture logs the queries, the answers returned to the user, and the user""s confirmation feedback in a log database. The search engine has a log analyzer to evaluate the log database and glean information that improves performance of the search engine over time. For instance, the search engine uses the log data to train the parser and the question matcher. As part of this training, the log analyzer is able to derive various weighting factors indicating how relevant a question is to a parsed concept returned from the parser, or how relevant a particular answer is to a particular question. These weighting factors help the search engine obtain results that are more likely to be what the user intended based on the user""s query.
In this manner, depending upon the intelligence provided in the query, the search engine""s ability to identify relevant answers can be statistically measured in terms of a confidence rating. Generally, the confidence ratings of an accurate and precise search improve with the ability to parse the user query. Search results based on a fully-parsed output typically garner the highest confidence rating because the search engine uses essentially most of the information in the user query to discern the user""s search intention. Search results based on a partially-parsed fragment typically receive a comparatively moderate confidence rating, while search results based on keyword searching are given the lowest confidence rating.