People increasingly rely on the World Wide Web (“Web”) to satisfy diverse information needs. To meet these needs, existing search engine technology allows users to input a query consisting of one or more keywords for a search for Web documents containing the keywords. Users typically select such keywords because they are thought to be related to the information being sought. Often, however, selected keywords are not always good descriptors of relevant document contents.
One reason for this is that most words in natural language have inherent ambiguity. Such ambiguity often results in search engine keyword/document term mismatch problems. Very short queries amplify such mismatch problems. Additionally, vocabularies used by Web content authors can vary greatly. In light of this, generating a search engine query that will result in return of a document list of relevance to a user is a difficult problem. In efforts to address this problem, search engine services typically expand queries (i.e., add terms/keywords). Unfortunately, existing query expansion techniques are considerably limited for numerous reasons.
One limitation, for example, is that global analysis query expansion techniques do not typically address term mismatch. Global analysis techniques are based on the analysis of a corpus of data to generate statistical similarity matrixes of term pair co-occurrences. Such corpus-wide analysis is typically resource intensive, requiring substantial computer processing, memory, and data storage resources. The similarity matrixes are used to expand a query with additional terms that are most similar to the terms already in the query. By only adding “similar” terms to the query, and by not addressing the ambiguities that are inherent between words in language, this global analysis approach to query expansion does not address term mismatch, which is one of the most significant problems in query expansion.
In another example, some query expansion techniques require explicit relevance information from the user, which can only be obtained by interrupting the task that the user is currently performing. To obtain this information, after submitting a query to a search engine and receiving a list of documents, rather that browsing the documents in the document list or submitting a new query, the user is asked to manually rank the relevance of the documents in the list. This may be accomplished by check-box selection, enumeration, or otherwise indicating that particular ones of the documents in the list are more relevant that others.
If the user volunteers and manually ranks the documents in the list, subsequent queries submitted to the search engine are then expanded with term(s) extracted from the documents that the user specifically marked as being relevant. Unfortunately, users are often reluctant to interrupt their immediate activities to provide such explicit relevance feedback. Thus, the search engine has no idea whether or not the user considered one document to be more relevant than another. This means that the search engine has no indication of any term that can be considered more relevant than another to a particular query. For this reason, explicit relevance feedback techniques are seldom used to expand queries.
In another example, some query expansion techniques automatically assume that the top-ranked document(s) that are returned to the user in response to a query are relevant. The original queries from the user are then expanded with term(s) extracted from such top-ranked document(s). This technique becomes substantially problematic when a large fraction of the top-ranked documents are actually not relevant to the user's information need. In this situation, words drawn from such documents and added to the query are often unrelated to the information being sought and the quality of the documents retrieved using such an expanded query is typically poor.
In another example, some query expansion techniques extract noun groups or “concepts” from a set of top-ranked documents. These noun groups are extracted based on co-occurrences with query terms and not based on the frequencies that the term(s) appear in the top-ranked documents. This technique is based on the hypothesis that a common term from the top-ranked documents will tend to co-occur with all query terms within the top-ranked documents. This hypothesis is not always true and often leads to improper query expansion. In other words, this technique is conducted in the document space only, without considering any judgments from users. It requires distinctive difference between the cluster of relevant documents and that of non-relevant documents in the retrieval result. This is true for many cases but does not hold some time, especially for those inherently ambiguous queries.
In light of the above, further innovation to select relevant terms for query expansion is greatly desired.