Developing search expressions that both convey a user's information need and match the way that need is expressed within the vocabulary of target documents has long been recognized as a difficult cognitive task for users of text search engines. A large majority of search engine users begin their search for a document with a query having only one or two words, and are then disappointed when they do not find the document or documents they want within the first ten or so results produced by the search engine. While user satisfaction can be improved, at least for some searches, by improving the manner in which results are ranked, very broad search queries cannot satisfy the more specific information desires of many different search engine users. One way to help a user refine a query expression is to offer term suggestions, just as a librarian might do so in a face-to-face interaction with an information seeker. Doing this automatically, however, is quite different, since the system must “guess” which terms, out of hundreds that may be conceptually related to a query, as most likely to be relevant to users conducting a search. Common approaches for choosing related terms include consulting an online thesaurus or a database of prior logged queries (that can be searched to find previous queries that contain one or more words in the current query). A weakness of such approaches is that there is no guarantee that the related terms so generated actually reflect the subject matter or vocabulary used within the corpus of documents itself. For this reason, alternative approaches that attempt to glean related terms dynamically from the actual results of a query have received much interest.
Some prior approaches that use a search result set to generate refinement suggestions include term relevance feedback (e.g. Vélez et al., Fast and Effective Query Refinement, in Proceedings of SIGIR'97, pp. 6-15), Hyperindex (Bruza and Dennis, Query Reformulation on the Internet: Empirical Data and the Hyperindex Search Engine, in Proceedings of RIAO'97, pp. 500-509), Paraphrase (Anick and Tipirneni, The Paraphrase Search Assistant: Terminological Feedback for Iterative Information Seeking, in Proceedings of SIGIR'99, pp. 153-159) and clustering (Zamir and Etzioni, Web Document Clustering: A Feasibility Demonstration, in Proceedings of SIGIR'98, pp. 46-54). Most relevance feedback methods have been designed for partial match search engines and typically involve broadening a query expression by the addition of multiple weighted terms derived from computations over a subset of retrieved documents explicitly tagged as relevant or non-relevant by a user. Hyperindex runs a syntactic analyzer over snippets returned by a search engine to extract noun phrases that contain the query term. Paraphrase extracts noun phrases from result set documents and chooses feedback terms to display based on lexical dispersion. Clustering approaches attempt to cluster result set snippets and derive representative query terms from the terms appearing within the respective clusters. While many of these approaches are functional, they are somewhat unsatisfactory for very large web search engines, either for reasons of runtime performance or relevance of feedback terms generated. There remains a need in the art for effective methods for assisting a user in identifying relevant search terms to improve a search.
To better understand the limitations of the prior art, a closer review of Vélez et al., Fast and Effective Query Refinement, in Proceedings of SIGIR'97, pp. 6-15, is warranted. Vélez et al. provides a system and method for query refinement in which terms from automated suggestions are added to an initial query in order to refine the initial query. In Vélez et al., the authors build upon the generic query refinement program DM. As put forth in Vélez et al., DM has the following steps:
Let                C=document corpus        q=user query        r=number of matching documents to consider        Wfcn(S)=algorithm specific weight term set S        
Then,                1. Compute the set of documents D(q)εC that match the query q.        2. Select a subset Dr(q) of top r matching documents        3. Compute the set of terms T(q) from the documents Dr(q) such that T(q)={t|∃dεDr(q):tεd} where d is a document and t is a term.        4. Compute the subset S of n terms from T(q) with the highest weight Wfcn(S).        5. Present S to the user as the set of term suggestions.As noted in Vélez et al., this approach is unsatisfactory because it is an expensive run time technique. In other words, it will take an unsatisfactory amount of time to compute the set of term suggestions S using DM in cases where the document database (corpus) is large.        
Vélez et al. seeks to improve on the speed of DM by precomputing a substantial amount of the work that is done dynamically by DM. In this precomputation phase, Vélez et al. generates a data structure that maps each single-word term t in the corpus to a respective set of terms m that the DM algorithm would suggest given the single term query t. Then, at run-time, an arbitrary query is received from the user. The query typically comprises a set of terms. In response to the query, Vélez et al. collects the respective sets of terms m corresponding to each of the terms in the query and merges each of these sets into a single set that is then returned to the user as suggestions for an improved search. For example, consider the case in which the user enters the query “space shuttle”. In this instance Vélez et al. could obtain the set of terms m that have been precomputed for the word “space” and the set of terms m that have been precomputed for the word “shuttle” and will merge them together in order to derive a set of suggested terms for the query “space shuttle”.
While this approach improves runtime performance by precomputing a subset of term relationships off-line, the Vélez et al. approach has drawbacks. First, there is a context problem. The Vélez et al. approach relies on the assumption that the set of terms m relevant to a given term t is the same regardless of whether the term t appears by itself or as part of a multi-term query. However, this is assumption is not always true. A term appearing within a multi-term phrase can in some instances express a completely different meaning relative to the term appearing by itself. Because of the underlying assumption in Vélez et al., the approach can potentially lead to inappropriate search term suggestions in some instances or else miss other suggestions that would be more relevant within the context of the entire query. Second, when the corpus (document database) changes, the Vélez et al. approach requires that sets of terms m respectively associated with terms t in the corpus be recomputed because each set of terms m depends on the contents of a plurality of files in the corpus including, possibly, files that have recently been added to the corpus.
Xu and Croft, SIGIR'97, pp. 4-11 describe another approach in which sets of terms that are related to a given concept are precomputed before a search query, which may include several concepts (search terms), is received. Like the Vélez et al. approach, the Xu and Croft methods relies on the construction of static cross document data structures and statistics that necessitate extensive recomputation of terms associated with concepts as the corpus changes over time. Accordingly, the computational demands of Xu and Croft are unsatisfactory for very large, dynamic document databases.
Given the above background, it would be desirable to provide assistance to users in refining their search queries into more narrowly defined queries, so as to produce search results more to their liking.