Term mismatch can be a challenge when performing a search. For instance, a query and its relevant documents are often composed using different vocabularies and language styles, which can cause term mismatch. Conventional algorithms utilized by search engines to match documents to queries may be detrimentally impacted by term mismatch, and thus, query expansion (QE) is oftentimes employed to address such challenge. Query expansion can expand a query issued by a user with additional relevant terms, called expansion terms, so that more relevant documents can be retrieved.
Various conventional QE techniques have been implemented for information retrieval (IR). Some traditional QE techniques based on automatic relevance feedback (e.g., explicit feedback and pseudo-relevance feedback (PRF)) can enhance performance of IR. Yet, such techniques may be unable to be directly applied to a commercial web search engine because relevant documents may be unavailable. Moreover, generation of pseudo-relevant documents can employ multi-phase retrieval, which may be expensive and time-consuming to perform in real time.
QE techniques, developed recently, utilize search logs (e.g., clickthrough data). These techniques, called log-based QE, can also derive expansion terms for a query from a (pseudo-)relevant document set. However, different from techniques based on automatic relevance feedback, the relevant set can be identified in log-based QE techniques from user clicks recorded in search logs. For example, the set of (pseudo-)relevant documents of an input query can be formed by including the documents that have been previously clicked for the query. Many conventional log-based QE techniques use a global model that is pre-computed from search logs. The model can capture the correlation between query terms and document terms, and can be used to generate expansion terms for the input query on the fly.
Despite the effectiveness of the log-based QE techniques, such approaches can suffer from various problems. For instance, data sparseness can impact effectiveness of log-based QE techniques. A significant portion of queries may have few or no clicks in the search logs, as stated by Zipf's law. Moreover, ambiguity of search intent can detrimentally impact log-based QE techniques. For example, a term correlation model may fail to distinguish the search intent of the query term “book” in “school book” from that in “hotel booking”. Although the problem can be partially alleviated by using correlation models based on phrases and concepts, there may be scenarios where the search intent is unable to be correctly identified without use of global context. For instance, the query “why six bottles in one wrap” can be about a package, and the intent of the query “Acme baked bread” can concern looking for a bakery in California. In such cases, a (pseudo-)relevant documents set of the input query, if available, can be more likely to preserve the original search intent than the global correlation model.