1. Technical Field
The invention relates to information interchange between data processing systems and, more particularly, to conversion of strings of an N-byte sparse data set to other N-byte strings in another sparse data set. Still more particularly, the invention relates to methods for conversion of double-byte character strings between any 16-bit data code to another 16-bit data code, such as between IBM Host Code and IBM PC Code.
2. Description of the Related Art
Information interchange between data processing systems requires consistency in application of meaning to the symbolic code exchanged. Several interchange codes are in common use and thus the need to convert between these codes frequently arises. For example, one interchange code, IBM PC Code, is used in IBM PS/2 Model 55 personal computers. IBM PC Code is an extension of the American National Standard Code for Information Interchange ("ASCII") to cover written Japanese. Another interchange code, IBM Host Code, is used in various mid-range and main frame systems manufactured by International Business Machines Corporation. IBM Host Code is a similar extension of Extended Binary-Coded Decimal Interchange Code ("EBCDIC").
Information interchange codes define character sets for national languages. The necessary symbols or characters are relatively few in number in most languages. English, for example, uses only 26 Roman letters, each of which has an upper case and a lower case representation, for 52 symbols. German requires the addition of only 7 symbols, allowing for three vowels receiving diacritics (both in upper case and lower case) and the Greek lower case beta symbol. Similar minor additions allow French, Dutch, Spanish, Italian, Portuguese and the Scandinavian languages to be written, all with one code set. 10 digits allow expression of all numbers. There is room for a variety of punctuation symbols and other specialized symbols, such as various national monetary symbols (e.g. $, , ). In all it has been found sufficient to provide 256 bytes of 8 bits each to express all of these characters. Two sets of standard interchange codes exist for each of the single-byte languages. They include the American National Standard Code for Information Interchange ("ASCII") and the Extended Binary-Coded Decimal Interchange Code ("EBCDIC"). Conversion from one interchange code to another for the single-byte language has required provision of a single 256 byte conversion table.
Unlike most languages, Chinese, Japanese, and Korean contain more than 256 characters. Traditional written Chinese utilizes in excess of 13,000 ideographs. Japanese utilizes between 3,000 and 8,000 ideographs (kanji characters) and several hundred other symbols for the numerical, hiragana and katakana characters. Conversion between interchange codes is further complicated by the fact that the ideograph sets for Japanese, Korean, Traditional Chinese and Simplified Chinese differ in content and size. To handle such large character bases, the interchange code sets for these languages use a double-byte of 16 bits for each character. This allows the expression of some 65,536 characters (2.sup.16). The double-byte character sets have been organized in several different ways. For Japanese, four different character sets are in current use in various computers. Korean uses five sets, and Traditional Chinese uses two sets. By way of example, one kanji symbol, the one for "LARGE", is `455B` in IBM Host Code, `91E5` in IBM PC Code, `1371` in Kanji Assigned Number Code, and `3471` in JIS (Japanese Industrial Standard) Code. The numbers are all hexadecimal representations.
In theory, a method similar to that used for conversion between single-byte interchange codes could be used in double-byte conversion. Unfortunately, in application, such an approach would take up considerably more memory than for a single-byte language. A conversion table for the single-byte language is only 256 bytes in length. A conversion table for just one double-byte language would be in excess of 128 Kilobytes in length.
The traditional methods for double-byte character set ("DBCS") conversion utilize a series of language dependent comparisons against the individual double-byte characters to determine their validity. These steps are in essence a series of range comparisons. Once the validity of the character has been established, a language dependent method is utilized to derive array indices for the character into a 256 entry by 256 entry conversion array of double-bytes. The array indices are then utilized to recover the appropriate character from the translation array.
Although not directed to conversion between interchange codes, U.S. Pat. No. 4,701,746 to Ueda et al. teaches a method which has applicability to simplification of the range comparison and indices generating portions of the traditional method. Ueda et al. teach a code converting device for converting code signals of 16 bits to code signals of 13 bits. Ueda et al. basically allow for the conversion of a series of sparse data items into consecutive references. The teaching exploits the fact that notwithstanding the large character set size for Japanese, it requires use of no where near 128 Kbytes of memory space for complete table of double-byte characters. However, Ueda et al. can deal with a maximum of about 8,000 data items. Therefore, Ueda et al. cannot be applied to either traditional or simplified Chinese. Nor is Ueda et al. of general application to multiple byte systems. Ueda et al. provides limited error checking capacity compared to the present invention.