1. Technical Field
The present invention relates to a method and system for data processing in general and, in particular, to a method and system for exchanging information between data-processing systems. Still more particularly, the present invention relates to a method and system for converting a multi-byte dataword in a first interchange code to a multi-byte dataword in a second interchange code.
2. Description of the Prior Art
Within a data-processing system, all the alphabets (or characters) of a national language can be encoded by a set of character codes known as interchange codes. For most languages, the number of character codes required within a set of interchange codes are relatively small. For example, merely 52 character codes are required for representing all 26 alphabets of the English language, including both upper and lower cases. An addition of only seven more character codes are required for representing all the German alphabets, allowing for three vowel-receiving diaeresis (including upper and lower cases) and the .beta. symbol. Similar minor additions of character codes enable French, Dutch, Spanish, Italian, Portuguese, and several other Scandinavian languages to be encoded separately within an individual interchange code set. Hence, it is quite sufficient to provide one byte (or eight bits) for encoding all the alphabets within any one of the above-mentioned national languages. In addition to alphabets, single-byte encoding also allows a set of ten digits for expressing numbers, a variety of punctuation symbols, and other specialized symbols such as various national monetary symbols (e.g. $, .English Pound., .Yen.) for the purpose for which they are intended. Currently, there are two standard sets of interchange codes for each single-byte encodable national language, namely, the American National Standard Code for Information Interchange (ASCII) and the Extended Binary-Coded Decimal Interchange Code (EBCDIC).
Not surprisingly, there are also national languages that contain more than 256 alphabets, or more precisely, characters. For example, the traditional written Chinese language consists of in excess of 13,000 characters, and the Japanese language consists of 3,000 to 8,000 ideographs (known as kanji characters) and several hundred other symbols for the numerical, hiragana, and katakana characters. Thus, each set of interchange codes for any one of the national languages that have a large character base must utilize two bytes (or 16 bits) for encoding one character. This allows for the expression of a maximum of 2.sup.16 =65,536 characters, excluding digits, punctuation symbols, etc. As in single-byte interchange code sets, there is also more than one standard set of interchange code for double-byte encodable languages. For example, there are four different interchange code sets for the Japanese language that are currently being utilized--IBM Host Code, IBM PC Code, Kanji Assigned Number Code, and JIS (Japanese Industrial Standard) Code. Similarly, there are five different interchange code sets for the Korean language and two different interchange code sets for the Traditional Chinese language.
Because several sets of interchange codes are in common usage today, the need for code conversion among different interchange code sets frequently arises. For example, the IBM PC Code, which is an extension of the ASCII for encoding Japanese characters, is typically utilized in IBM-compatible personal computers, while the IBM Host Code, which is a similar extension of the EBCDIC, is typically utilized in various midrange and mainframe computers. In order to provide information interchange between any two of the above-mentioned data-processing systems, each having a different set of interchange codes, a consistency in application of meaning to the symbolic code exchanged is required.
For a single-byte encodable language, conversion from one set of interchange codes to another requires a single conversion table. In theory, the concept of a conversion table may also be applied to double-byte conversion. In application, unfortunately, such an approach would take up considerably more memory than for a single-byte encodable language. For example, the size of a typical conversion table for a single-byte encodable language is 256 bytes in length, while the size of a typical conversion table for a double-byte encodable language would be in excess of 128 Kilobytes in length.
Further, the traditional methods for double-byte code set conversion utilize a series of language-dependent comparisons against the individual double-byte characters to determine their validity. These steps are, in essence, a series of range comparisons. Once the validity of the character has been established, a language-dependent method is utilized to derive array indices for the character into a 256-entry-by-256-entry conversion array of double-bytes. The array indices are then utilized to recover the appropriate character from the translation array. U.S. Pat. No. 5,309,358 to Andrews et al., teaches a method that simplifies the range-comparison and indices-generating portions of the traditional method. The present disclosures provide additional improvements to the method as taught under Andrews in U.S. Pat. No. 5,309,358, such that conversion time can be further reduced.