1. Field of the Invention
This invention relates to the validation of character code sequences.
2. Discussion of the Related Art
Double-byte character encoding is commonly used for a number of purposes, among them encoding complex character sets such as GB 2312-80, the simplified Chinese characters used in mainland China. GB 2312-80 contains 7,445 Chinese characters represented as a pair of bytes wherein each byte is a number from 161 to 254. This allows the mixing of the Chinese characters with conventional ASCII text, which is represented by byte values in the range of 0 to 127. Technically, the simultaneous representation of GB 2312-80 with ASCII is called EUC-CN encoding, though we refer to it as GB 2312-80 throughout this specification for simplicity. This necessarily implies that bytes in the range of 161 to 254 must come in pairs and any string of such characters must have an even number of such bytes in a row between any two single-byte ASCII characters. Byte values in the range of 128 to 160 are invalid for GB 2312-80. Despite these rules, invalid characters and sequences, collectively referred to as “noise”, is found to occur in 5% to 10% of Chinese webpages and newswire texts. The origins of this noise is obscure.
Applications currently available for the processing of double-byte encodings are inadequate to cope with noise. For example, GB to unicode converters simply crash on the first invalid byte sequence and all information following the noise is lost.
Repairing such noise presents a problem of ambiguity. For example, consider the case of a nine-byte sequence of GB 2312-80 characters, all in the range of 161–254—which “half character” is the noise to be discarded? Discarding any one of the bytes will likely leave four perfectly valid Chinese characters, but in an incomprehensible sequence. In probability, only one of the bytes may be discarded so as to produce an intelligible string of characters.
What is needed is a method of validating strings of double-byte characters to detect and remove such noise.