1. Technical Field
The disclosed embodiments relate to systems and methods for normalizing query words in web search, and more specifically, for the reformulation of queries received from users before submission to a search engine to account for cases of split, join, hyphen, and apostrophe.
2. Related Art
Internet advertising is a multi-billion dollar industry and is growing at double digits rates in recent years. It is also the major revenue source for internet companies, such as Yahoo!® or Google®, which provide advertising networks that connect advertisers, publishers, and Internet users. A major portion of revenue has historically come from sponsored search advertisements and other advertising related to search through search engines, for instance.
A search engine is a computer program running a server that helps a user to locate information. Using a search engine, a user can enter one or more search query terms and obtain a list of resources that contain or are associated with subject matter that matches those search query terms. While search engines may be applied in a variety of contexts, search engines are especially useful for locating resources that are accessible through the Internet. Resources that may be located through a search engine include, for example, files whose content is composed in a page description language such as Hypertext Markup Language (HTML). Such files are typically called pages. One can use a search engine to generate a list of Universal Resource Locators (URLs) and/or HTML links to files, or pages, that are likely to be of interest.
Some search engines order a list of files (or web pages) before presenting the list to a user. To order a list of files, a search engine may assign a rank to each web page in the list. When the list is sorted by rank, a web page with a relatively higher rank may be placed closer to the head of the list than a file with a relatively lower rank. The user, when presented with the sorted list, sees the most highly ranked web pages first. To aid the user in his search, a search engine may rank the web pages according to relevance. Relevance is a measure of how closely the subject matter of the file matches query terms.
To find the most relevant web pages, search engines typically try to select, from among a plurality of web pages, files that include many or all of the words that a user entered into a search request. Unfortunately, the web pages in which a user may be most interested are too often web pages that do not literally include the words that the user entered into the search request. If the user has misspelled a word in the search request, then the search engine may fail to select files in which the correctly spelled word occurs. Typically, eight to ten percent of queries to web search engines have at least one query term that is misspelled.
The core, or organic, search results are usually based on some relevancy model while other parts of the search results web page are set apart for sponsored search advertisements paid for by advertisers to be returned with the organic search results for specific keywords. Without returning relevant results, however, user satisfaction with a search engine is likely to decline and, therefore, so will advertisers interested in sponsored search listings that target those users.
Accordingly, a search engine needs to return results as relevant as possible to the entered search terms, regardless of whether a searching user properly spells or types in the terms of his or her search. Models have been developed to manipulate misspelled or mistyped terms, also referred to as canonicalization. The canonicalization of search terms used in queries and search listings removes common irregularities of search terms entered by searchers and web site promoters, such as capital letters and pluralizations, in order to generate relevant results.
The algorithms and systems for dealing with misspellings, however, do not handle well special cases of irregularities, and re-programming or re-designing such algorithms and systems would take significant time, effort, and expense. Such irregularities include, for instance, dealing with split words that should be joined, with joined words that should be split, and with various words that should include or exclude a hyphen or an apostrophe.