This specification relates to digital data processing, and more particularly to identifying a property of an electronic document, such as the language in which the document is written or the encoding used for the text of the document.
Given an electronic document containing encoded text, which typically is stored or transmitted as a sequence of raw bytes, it is useful to identify which encoding was used for the text, as well as in which language the text is written. These two problems typically are known as encoding detection and language detection.
Encoding detection and language detection are closely related to each other. On one hand, a language can usually be encoded in several different encodings; for example, Japanese can be encoded in EUC-JP, SHIFT-JIS, JIS, and UTF8. On the other hand, most encodings can encode more than one language. For example, CP1251 can encode Russian, Ukrainian, Bulgarian, and Macedonian. As a result, it typically is difficult to recognize either the language or the encoding alone from a piece of text. In language detection, for example, the same language being encoded in different encodings will result in vastly different byte patterns.
One conventional solution to language detection is to associate language with encodings and create one class for each unique pair. For example, Chinese in GB and Chinese in UTF8 are treated as being from two different classes and are recognized by two different models. This approach leads to a large number of distinct classes for language detection.