This specification relates to query expansion for users submitting queries to search engines.
Search engines—and, in particular, Internet search engines—aim to identify resources (e.g., web pages, images, text documents, multimedia context) that are relevant to a user's needs and to present information about the resources in a manner that is most useful to the user. Internet search engines return search results in response to a user submitted query. If a user is dissatisfied with the search results returned for a query, the user can attempt to refine the query to better match the user's needs.
Some search engines provide a user with suggested alternative queries, for example, expanded queries, that the search engine identifies as being related to the user's query. Techniques for finding synonyms of query words for query expansion typically depend on natural language models or user search log data. The identified synonyms of query words can be used to expand a query in an attempt to identify additional or more relevant resources to improve user search experience.
Electronic documents are typically written in many different languages. Each language is normally expressed in a particular writing system (i.e., a script), which is usually characterized by a particular alphabet. For example, the English language is expressed using the Latin alphabet while the Hindi language is normally expressed using the Devanāgarī alphabet. The scripts used by some languages include a particular alphabet that has been extended to include additional marks or characters. In transliteration, the script of one language is used to represent words normally written in the script of another language. For example, a transliterated term can be a term that has been converted from one script to another script or a phonetic representation in one script of a term in another script. Techniques for finding synonyms of query words for query expansion may not work well for finding synonyms of query terms that are transliterated terms. For example, current natural language techniques do not work well with transliterated data, and search log data typically provide poor coverage for most transliterated variations.