Uniform Resource Locators (URLs) corresponding to web pages have been shown to contain useful information for measuring the relevance of web pages to search queries. There has been a great deal of work performed that addresses the issue of making use of URLs in improving the quality of search-result relevance ranking. This work traditionally has been focused on western-language web pages whose alphabet set could be represented by encoding characters such as, for example, ASCII characters, because URLs are composed of strings of characters from the US-ASCII character set (referred to herein as encoding characters).
For languages that include characters that are not allowed for use in URLs (i.e., “non-encoding characters,” (NECs) which can include, for example, Chinese, Japanese, Korean, and other similar languages), matching queries to URLs tends to be difficult since the URLs are represented by encoding characters. To more effectively utilize URLs for relevance ranking in NEC-language markets, it is desirable that a search query and corresponding URLs are represented in the same format. Consistent formats can be achieved in one of two manners. The first is by altering the query at online serving time, in which the NEC query is converted into English words, pinyin representations (i.e., pronunciation of Chinese characters), digital characters, or a combination of these, based on a mapping table built offline according to rules of similar meaning or pronunciations between the NEC words and their corresponding encoding-language form. The other, more robust, approach is to transform meaningful parts of the URL into NEC words and build the transformed URL into the web index during index generation.