1. Technical Field
The disclosed embodiments relate to systems and methods for normalizing query words in web search, and more specifically, for the handling of abbreviations detected in the queries to possibly expand them when likely to improve search results.
2. Related Art
Internet advertising is a multi-billion dollar industry and is growing at double digits rates in recent years. It is also the major revenue source for internet companies, such as Yahoo! of Sunnyvale, Calif. or Google of Mountain View, Calif., which provide advertising networks that connect advertisers, publishers, and Internet users. A major portion of revenue has historically come from sponsored search advertisements and other advertising related to search through search engines, for instance.
A search engine is a computer program running a server that helps a user to locate information. Using a search engine, a user can enter one or more search query terms and obtain a list of resources that contain or are associated with subject matter that matches those search query terms. While search engines may be applied in a variety of contexts, search engines are especially useful for locating resources that are accessible through the Internet. Resources that may be located through a search engine include, for example, files whose content is composed in a page description language such as Hypertext Markup Language (HTML). Such files are typically called pages. One can use a search engine to generate a list of Universal Resource Locators (URLs) and/or HTML links to files, or pages, that are likely to be of interest.
Some search engines order a list of web pages before presenting the list to a user. To order a list of web pages, a search engine may assign a rank to each file in the list. When the list is sorted by rank, a web page with a relatively higher rank may be placed closer to the head of the list than a file with a relatively lower rank. The user, when presented with the sorted list, sees the most highly ranked files first. To aid the user in his search, a search engine may rank the web pages according to relevance. Relevance is a measure of how closely the subject matter of the web page matches query terms.
To find the most relevant files, search engines typically try to select, from among a plurality of web pages, web pages that include many or all of the words that a user entered into a search request. Unfortunately, the web pages in which a user may be most interested are too often web pages that do not literally include the words that the user entered into the search request. If the user has misspelled a word in the search request, then the search engine may fail to select web pages in which the correctly spelled word occurs. Typically, eight to ten percent of queries to web search engines have at least one query term that is misspelled. While technically not a misspelling, abbreviated words in queries may often not be recognized or not used in the abbreviated form in many of the web pages.
The core, or organic, search results are usually based on some relevancy model while other parts of the search results web page are set apart for sponsored search advertisements paid for by advertisers to be returned with the organic search results for specific keywords. Without returning relevant results, however, user satisfaction with a search engine is likely to decline and, therefore, so will advertisers interested in sponsored search listings that target those users. Accordingly, a search engine needs to return results as relevant as possible to the entered search terms, regardless of whether an abbreviation is used that may be poorly recognized throughout the web pages available to the search engine.
Mining text associations, especially, word associations, is important to Information Retrieval (IR) to achieve semantic match instead of literal word match. Most automatic text association-finding methods are based on word co-occurrence information. Though these techniques are effective in document modeling, they often fail in query modeling because of (i) lack of information in queries, (ii) noise in data resources, especially for web data, and (iii) difficulties to achieve precise text associations, e.g., it may be easy to associate “apple” with “fruit” but is hard to associate only “the most popular Japanese apple” with “Fuji” though they have the same search intent.