Today, Internet technologies link people together regardless of location. The rapid growth of the Internet use and explosion of the technological innovations it has engendered has fueled the growth of Web-based solutions designed to help individuals deal with the overwhelming amount of online information available on their desktops.
One such Web-based solution is a search engine to allow individuals to search and retrieve information across a network of changing resources. However, a simple search will typically return too many matching documents to be useful, and many if not all of the returned documents may be irrelevant to the user's need. Thus, few Web-based information search and retrieval applications enable users to discover specific answers to a question or even locate the documents most likely to contain the answer. This outcome is especially true when a user's query, or question includes commonly used words and/or refers to generic concepts or trends.
To address the need to find more precise answers to a user's query, a new generation of search engines, or “question answering systems” has been developed (e.g., the AskJeeves® question answering system that is located on the Web at http://www.askjeeves.com). Unlike the traditional search engines, which only use keywords to match documents, this new generation of search engines first attempts to “understand” user questions by suggesting other similar questions that other people have often asked and for which the system already has the correct answers. (The correct answers are typically pre-canned because they have been prepared in advance by human editors). Thus, if one system suggested question is truly similar to the user's question, the answer provided by the system will be relevant.
The common assumption behind such question answering systems is that many people are typically interested in the same questions, which are also called the “Frequently Asked Questions/Queries”, or “FAQs”. If the system can correctly identify these FAQs, then various forms of the user questions can be answered with more precision.
A number of human editors typically work to improve the contents of a search engine's hosting Website so that users can find relevant information from the website in a more precise manner. Their work mainly concerns the following two aspects: (1) if the search engine does not provide sufficient information for some often asked questions, the editors will add more documents in it to answer these questions; and, (2) if many users asked the same questions (FAQs) in a certain period of time (a hot topic), then the answers to these questions will be checked manually and directly linked to the questions. However, evaluating by human editors which user submitted questions/queries are FAQs and which are not is not a simple procedure. One reason for this is because user submitted queries are typically very different not only in form but also generally different in intention.
For example, consider one query clustering approach wherein queries are represented as respective sets of keywords. If it is determined that a first query and a different second query share one or more of these keywords in common, then they are considered to be somewhat similar queries. Analogously, it is traditionally thought that the more keywords that respective queries share in common, the greater the similarity between the queries, and the more these shared keywords are considered to be important in identifying other similar queries.
Unfortunately, there are a number of problems associated with traditional query clustering techniques. One problem, for example, is that a particular keyword that is shared across two respective queries may not represent the same information need across other various queries (e.g. the keyword “table” may refer to a computer software data structure, an image in a document, a furniture item, and so on). Additionally, different keywords may refer to the same concept as the particular keyword (e.g., the keyword “table”, may also be referenced in other queries with the following keywords: “diagram”, “bench”, “schema”, “desk”, etc.). Therefore, the similarity between two semantically similar queries may be small, while the calculated similarity between two semantically unrelated queries may be high, especially when queries are short.
In view of the above, it is apparent that traditional query clustering techniques are often ineffective because of common non-correspondence between keywords and keyword meanings. Accordingly, the following described subject matter addresses these and other problems associated with evaluating the similarity between various queries so that similar queries can be clustered together to rapidly determine FAQs.