Searching the Internet to locate relevant documents and advertisements can be challenging because search queries and web documents/advertisements often use different language styles and vocabularies. There are various issues related to current Internet search technologies. Often, a query contains terms that are different from, but related to, the terms in the relevant document, which leads to a well-known information retrieval problem known as lexical gap problem. Occasionally, when a query contains terms having multiple meanings causing ambiguity, a search engine retrieves many documents that do not match the user's intent, which may be known as the noisy proliferation problem. Both of these issues are substantially more prevalent in Internet search due to the fact that search queries and web documents are composed by a large variety of people and in very different language styles.
Typical information retrieval methods developed in the research community, in spite of their state-of-the-art performance on benchmark datasets (e.g., the Text Retrieval Conference (TREC) collections), are based on bag-of-words and exact term matching schemes, and cannot deal with these issues effectively. Some methods employ ad-hoc measures that tend to worsen the noisy proliferation problem. Although several approaches have been proposed to determine relationships between the terms in queries and the terms in documents, most of these approaches rely on inadequate measures of term similarity (e.g. cosine similarity) according to term co-occurrences across queries and documents. For example, in a paid search system, it is desirable to locate documents (which may include advertisements) that are relevant to the search query and are of potential user interest, whereby users will more likely click them; however known techniques often return irrelevant documents because of the lexical gap problem and/or the noisy proliferation problem caused by a language discrepancy between document content and the search query.