In a simple information retrieval system, a user typically enters a query comprising one or more query terms and receives a list of documents containing the query terms. Documents that do not contain the query terms are ignored. However, “recall,” or the fraction of the documents that are relevant to the query that are successfully retrieved, is low for this simple information retrieval system. As a result, documents which may be of interest to the user may not be identified in response to the query, and thus never presented to the user.
One technique used to increase recall is known as “stemming,” which involves stripping out pre-fixes or post-fixes to a word. Such pre-fixes and post-fixes are common in the English language, and are seen in other languages. Conventionally, stemming is typically applied when indexing a body of documents. For example, an occurrence of the word “tickets” in a document would be indexed as “ticket.” When a query is provided to the search engine, stemming of the query terms (also known as “term reduction”) is performed—the same kind of transformation performed during indexing—and the index is accessed using the stemmed query terms. As an example, a search for “ticketing” on a search engine employing stemming would return documents containing the word “ticket” (the stem of “ticketing”) and documents containing the word “tickets” (which has the same stem, “ticket,” as “ticketing”).
Another technique used to increase recall is known as “query expansion,” in which one or more query terms are supplemented with additional related query terms. One known technique for identifying related terms is analyzing the co-occurrence of terms or co-occurrence with similar terms observed in documents during indexing and query terms submitted in previous search queries (typically obtained by processing query logs) to produce a thesaurus of semantically related terms. Such a technique may, for example, determine that “plane” and “aircraft” are related, that “hospital” and “medical” are related. In such an example, a search query including the term “hospital” may be expanded to also include the term “medical.” In some cases, a weighting may be applied to an added term based on the observed pairwise degree of co-occurrence between the original term and the expanded term. Such weighting signals to a result ranking process where a document is retrieved based on an expanded term with a low degree of co-occurrence, it should be ranked lower among the retrieved documents.
Although stemming and query expansion each generally increase recall, they also generally result in reduced “precision,” or the fraction of the documents retrieved that are relevant to the query. As a result, a search may result in many documents which are not of interest to a user in response to a query. There is a need to improve search results by increasing recall while avoiding this loss of precision and/or improve the ranking of the search results.