Many query processing tasks such as query suggestion, query reformulation, and query expansion implicitly or explicitly calculate and utilize query similarity. In query suggestion, a user may be provided with similar queries that represent the same or related search intent as a current user provided query. Query reformulation attempts to transform the current query into a similar but better-formed query to help a user find more relevant documents. In query expansion, similar terms (e.g., words) are added to an original query in order to improve the relevance of the search results for the original query. In some sense, the expanded query can also be viewed as a similar query to the original query.
Various approaches for measuring query similarity have been proposed for different applications using different data sources. Most of these approaches are mainly focused on common or frequent queries. However, in general, the distribution of web search queries follows a heavy tailed power-law distribution. This implies that a major proportion of queries are in fact rare queries, or queries that may have been issued infrequently.
User behavior data, including click-through and session data, has been widely accepted as an important data source to accurately calculate query similarity. The basic idea behind various methods which utilize user behavior data for query similarity calculation is that two queries should be similar to each other if they share similar behavior patterns (e.g., many co-clicked URLs or co-occurred sessions). These various methods generally work well on common queries, but often perform poorly on rare queries, as there is typically insufficient user behavior data (e.g., data sparsity) for rare queries.
Several methods have been proposed to determine similarities between queries based on user behavior data, but such methods do not overcome the problem of data sparsity on rare queries. Additionally, in an extreme case, some rare queries may be missing in the user behavior data, or never previously entered into an associated search engine.
Traditionally, cosine similarity and/or bag-of-words assumptions are widely employed to measure the similarities between text strings. However, utilizing traditional techniques, there are two fundamental impediments to accurate similarity calculation: term ambiguity (i.e., terms that are literately the same but semantically different) and term dependency (i.e., terms that are literately different but semantically similar). These issues are exacerbated in query similarity calculation since many queries are typically very short.