With ever increasing amount of data stored at various servers, the task of efficient information retrieval becomes an ever-more imperative one. Taking the Internet as an example, there are millions and millions of resources available on the Internet and several search engines (such as, GOOGLE™, YAHOO!™, YANDEX™, BAIDU™, and the like) which aim to provide users with a convenient tool for finding relevant information that is responsive to a user's search intent.
A typical search engine server executes a crawling function. More specifically, the search engine executes a robot that “visits” various resources available on the Internet and indexes their content. Specific algorithms and schedules for the crawling robots vary, but on the high level, the main goal of the crawling operation is to (i) identify a particular resource on the Internet, (ii) identify key themes associated with the particular resource (themes being represented by key words and the like), and (iii) index the key themes to the particular resource.
Once a search query from the user is received by the search engine, the search engine identifies all the crawled resources that are potentially related to the user's search query. The search engine then executes a search ranker to rank the so-identified potentially relevant resources. The key goal of the search ranker is to organize the identified search results by placing the potentially most relevant search results at the top of the search engine results list.
A typical search query comprises a string of words typed by the user. However, users often fail to select the effective terms when typing the string of words. For example, an English gourmet lover desirous of expanding his culinary experience may enter the search query “Japanese gastropub in Montreal”, whereas most of the relevant pages are indexed with the term “izakaya” rather than “gastropub”. Thus, documents that satisfy the user's information needs may use different terms than the specific query terms used by the user.
Generally speaking, there exist a few types of computer-based approaches to modify/expand the query terms to better meet with the user's search intent. For example, a simple approach is to use a pre-constructed semantic database, such as a thesaurus database. However, the construction of the thesaurus database is expensive and is generally restricted to one language.
U.S. Pat. No. 7,890,521 discloses a system that automatically generates synonyms for words from documents. During operation, this system determines co-occurrence frequencies for pairs of words in the documents. The system also determines closeness scores for pairs of words in the documents, wherein a closeness score indicates whether a pair of words are located so close to each other that the words are likely to occur in the same sentence or phrase. Finally, the system determines whether pairs of words are synonyms based on the determined co-occurrence frequencies and the determined closeness scores. While making this determination, the system can additionally consider correlations between words in a title or an anchor of a document and words in the document as well as word-form scores for pairs of words in the documents.
U.S. Pat. No. 9,158,841 discloses a method of evaluating semantic differences between a first item in a first semantic space and a second item in a second semantic space. The method includes: calculating a first ordered list of N nearest neighbors of the first item within the first semantic space; calculating a second ordered list of N nearest neighbors of the second item within the second semantic space; and computing a plurality of similarity measures between the first n nearest neighbors of the first item and the first n nearest neighbors of the second item, wherein n and N are positive integers and 1≤n≤N.
US2015/0046152 discloses a method for generating a set of concept blocks, wherein the concept blocks are words in a corpus of documents that can be processed to extract trends, build an efficient inverted search index, or generate a summary report of the content. The method entails generating a plurality of target words from the corpus, determining context strings for the target words, obtaining pattern types that are based on number of words and position of words relative to the target words, and assigning weights to each of the context strings having a particular pattern type. The target words are then expressed as vectors that reflect the weights of the context strings. The vectors are compared and grouped into clusters based on similarity. Target words in the resulting clusters are concept blocks. A subgroup of clusters may be selected for another iteration of the process to catch new concept blocks.