The present invention relates to data processing by digital computer, and more particularly to identifying a language encoded in a data string.
Text is typically electronically stored as a string of data. The data string, for example, can represent text characters as code points having a hexadecimal byte value. The hexadecimal byte value maps to a particular character in one or more coded character sets. A character set describes a relationship between each character and a respective code point where each code point codes for a single character. There are a number of ISO character sets, for example, ISO 8859-1, which maps a set of Latin characters to their respective code points. Other character sets are created by different entities; for example, there are a number of Microsoft Windows character sets. FIG. 1 illustrates a character set for ISO 8859-1 (Latin 1). In byte form, each Latin 1 character has a one-byte hexadecimal code point value. For example, the letter “Ä” has a value of 0x41. The same code point value can correspond to different characters in different character sets. For example, the code point 0×CA corresponds to the symbol “Ä” in ISO 8859-1, but the same code point corresponds to the symbol “Δ” in ISO 8859-7 (Greek). Thus, the hexadecimal byte value defines a coordinate position for a character set but does not identify which character set of the available characters sets the value belongs. In other characters sets, a code point includes more than one byte.
Consequently, there can be a string of hexadecimal byte values that each define a position that exists in multiple different character sets. The identity of the character residing in the position corresponding to the hexadecimal byte value depends on the particular character set. As a result it is often difficult to identify the character itself without information in the data string identifying the language of the hexadecimal byte values, it can be difficult to determine the language corresponding to the data string. When translating the data string into another form, for example, into Unicode or to display the text, it is necessary to know the language corresponding to the data string.