This specification relates to detecting writing systems and languages.
A writing system uses symbols, e.g., characters or graphemes, to represent sounds of a language. A collection of symbols in a writing system can be referred to as a script. For example, a Latin writing system, including a collection of Roman characters in one or more Roman scripts, can be used to represent the English language. A particular writing system can be used to represent more than one language. For example, the Latin writing system can also be used to represent the French language.
In addition, a given language can also be represented by more than one writing system. For example, the Chinese language can be represented by a first writing system, e.g., Pinyin (or Romanized Chinese). The Chinese language can also be represented using a second writing system, e.g., Bopomofo or Zhuyin Fuhao (“Zhuyin”). As yet another example, the Chinese language can be represented using a third writing system, e.g., Hanzi.
The complex relationship between writing systems and languages increases the difficulty of automatically detecting languages from input text. The accuracy and precision of detecting languages from input text can depend on an amount and quality of training data used to train a classifier.