Determining a geographic match, or geocoding, to a text search is a relatively well studied problem. There are numerous web-based and commercial mapping products used for route planning, fuel/cost estimation, and simple travel planning. Such products include Google Maps™, Yahoo Maps™, and Windows Live Local™. Each of these products uses the same language (such as English) in both the underlying database and the user interface. Attempts to geocode location queries in a language different from the underlying database, particularly when the language uses a different script or alphabet, have a very low success rate, if any. For example, queries in Hindi, Arabic, or Japanese made against an English language geo-database may have a very low success rate.
Geocoding also includes more than one type of query. One is a structured address, for example, 233 South Wacker Drive, Chicago, Ill., 60606, is well formed in a conventional U.S. address format, with no misspellings. These queries, in English, usually return accurate results. An unstructured query, such as Sears Tower, or the Loop, may also return accurate results for some well known landmarks or features.
However, several factors can greatly reduce the accuracy of results for geocoding queries. One factor is ill-formed queries, with either data missing or data in a non-standard sequence. Another factor is misspellings in the query. A third factor is queries in a language different from the underlying database, which may be even further complicated by queries in languages with alphabets or character sets different from the underlying geographic database.
Yet other factors that raise inaccuracy in results for geocoding queries include different address formats across national boundaries, extraneous terms (that don't match anything) or non-unique identifiers (for example, there are over 1000 “1st Cross” roads in Bangalore, India).
Geographic data, and particularly map data, are intrinsically tied to given regions, and hence are available predominantly in local languages. In addition, the business, resource and interoperability considerations often dictate that such data are created only for a small set of languages. Yet in today's increasingly globalized world, there is a clear need for accessing geographic information across languages. Examples range from Indian citizens who want to query in their own local languages, the land records traditionally created in English, through cross-lingual geographic indexing of documents, to visitors at the 2008 Olympics who will want to find Beijing locations using many languages other than Mandarin Chinese. Despite the clear motivation for crosslingual location searches, to the best of our knowledge, there are no academic or commercial systems that support general crosslingual location search.
A possible approach to crosslingual location search would be to create and represent all geographic entities in all languages, but this is financially and logistically unviable (for example, a country of the size of the US has several million unique streets, localities, landmarks, etc., and moreover, these are updated on a continual basis). Alternatively, one could use a machine translation/transliteration system to convert the query terms to the target language, and then process the results in a monolingual geocoder in the target language. However, the linguistic ambiguities inherent in the process, increase the search space exponentially, and degrade the accuracy of results greatly. In addition, the fact that descriptions of locations and addresses are structured differently in different regions—or may be unstructured altogether—makes cross lingual location search a particularly difficult challenge.