This specification relates to transliteration.
Electronic documents are typically written in many different languages. Each language is normally expressed in a particular writing system (e.g., a script), which is usually characterized by a particular alphabet. For example, the English language can be expressed using Latin characters while the Japanese language can be expressed using Katakana characters. The scripts used by some languages include a particular alphabet that has been extended to include additional marks or characters. In transliteration, a first writing system is used to represent words normally represented by a second writing system. For example, a transliterated term can be a term that has been converted from one script to another script or a phonetic representation in one script of a term in another script. Transliterations can differ from translations because the meanings of terms are not reflected in the transliterated terms.
Techniques for extracting transliterations pairs may require annotated training data or language specific data. For example, conventional techniques for transliteration use rules, which specify that one or more particular characters in a first script can be mapped to one or more particular characters in a second script. These rules are typically language specific and may require annotated training data and/or parallel training data (e.g., comparable training data in the first and second scripts).