This disclosure relates to providing n-gram analysis for search queries.
A search engine allows a user to provide a search query for which search results are returned in response. Some search engines can analyze the query to identify n-grams. N-grams are groups of words that have a statistically significant probability of appearing adjacent to one another when compared to their statistical chance of appearing next to other words. For example, if a user entered a search query “hot dog.” The user is probably attempting to retrieve information about the bigram “hot dog,” rather than just any document that includes the words “hot” and “dog.” Thus, the terms “hot” and “dog” are constituent terms describing a bigram. Search systems commonly use bigram language modeling to identify and weight the occurrence of bigrams within a document (see, e.g., Srikanth, M. And Srihari, R. “Biterm Language Models for Document Retrieval,” Special Interest Group on Information Retrieval '02 (SIGIR '02), Aug. 11-15, 2002; and, Song, F. and Croft, W. B., “A General Language Model for Information Retrieval,” Conference on Information and Knowledge Management '99 (CIKM '99)). However, identifying n-grams (e.g., bigrams) can be computationally intensive when there are many terms included in a search query. For example, a query containing five terms can describe four potential bigrams, and each of the potential bigrams is analyzed to determine whether it is a bigram. Inspecting each of the potential bigrams can be inefficient. Moreover, traditional bigram analysis assumes complete sentences, correct grammar, etc. However, search queries are often expressed as a sequence of keywords. It can be difficult to determine whether two consecutive words within a search query are intended to be an n-gram or separate keywords.