With the exponential growth of the internet and computer networks in general, it has become increasingly important that the computers connected by these networks, and their operating systems, be able to work with information represented in a variety of character sets. For example, in North America the majority of computer systems represent information in Latin characters, while in Russia information is primarily represented in Cyrillic characters and in Japan information is often represented in Katakana characters.
Within computer systems and networks, characters are represented by values assigned to characters, the specific assignments being defined within a coded character set. A common encoding scheme is ASCII which, for example, specifies that the value 97 (hexadecimal 0x61) represents the lower case "a" character, the value 65 (0x41) represents an upper case "A" character, etc. Another common encoding scheme is EBCDIC which, for example, specifies that the value 129 (hexadecimal 0x81) represents the lower case "a" character and the value 193 (hexadecimal 0xC1) represents the uppercase "A" character.
While different character sets exist in both of these encoding schemes, other encoding schemes are employed in a variety of other countries and/or languages. Accordingly, using information represented in a different coded character set, or changing the representation of information from one coded character set to another, is often required but can be difficult to accomplish. In particular, ASCII and EBCDIC encoded character sets are single octet (byte) systems and thus can only encode a maximum of 256 characters, including all control characters, etc. When it is desired to encode a larger set of characters, a multi-octet encoding scheme is required.
In an attempt to define a universal method of representing characters which allows all, or substantially all, desired characters to be represented, the International Standards Organization (ISO/IEC) has published the Universal Multiple Octet Coded Character Set (UCS) as standard ISO/IEC 10646-1, and it's amendments. UCS can define over two billion characters and UCS includes a two-byte (UCS-2) form and a four-byte (UCS-4) form and these forms and the UCS standard are described in ISO/IEC 10646-1: 1993--Universal Multiple-Octet Coded Character Set and in the corresponding Unicode Standard V2.0, from the Unicode Consortium which employs the UCS-2 form of UCS, and the contents of these publications are incorporated by reference herein.
While UCS provides a reasonable framework within which to handle large numbers of characters, many pre-UCS computer systems and pre-UCS information/data employs other character sets and it is necessary to transform information represented in UCS to a form compatible with these other character sets, or vice versa. In particular, information represented in UCS cannot be directly processed by systems using ASCII encoding or EBCDIC encoding, therefore transformation to and from UCS is required. Due to the widespread use of ASCII (7 bit, 128 characters), it was particularly desired to have a transform for UCS to ASCII and back.
A transform referred to as UTF-8 has been developed to transform UCS-2 to and from ASCII. This transform is defined in Amendment No. 2 to ISO/IEC 10646-1 and the contents of this publication are incorporated herein by reference. UTF-8 transforms UCS-2 information to an ASCII-friendly format and vice-versa and maintains the ASCII characters (assigned to values 0 through 127) as single octets and the rest of the UCS-2 character set as sequences of two to six octets (a multi-octet sequence). In particular, the 95 characters and the SPACE, commonly referred to as the graphic characters of the ASCII set, or the G0 set and the 33 control characters (the 32 control characters from 0x00 to 0x1F, referred to as the C0 set, and the DELETE character at 0x7F) are maintained as single octets.
International ISO/IEC 4873 standard defines an 8-bit encoding standard, which defines the structure for other ISO/IEC 8-bit characters sets, such as ISO/IEC 8859-1 (Latin Alphabet No. 1), referred to herein as ISO-8 bit encoding, which employs the same assignment of ASCII characters to values 0x00 through 0x7F but which also includes additional characters. In particular, ISO/IEC 4873 also includes a second set of control characters from 0x80 to 0x9F, referred to as the C1 set and these latter control characters are mirrored as two-octet escape sequences, i.e.--ESC xx, where `xx` is a suitable value. The UTF-8 transform does not maintain the C1 set as single-octet characters.
It should be emphasised that UTF-8 transforms information between UCS and ASCII-friendly representations, not ASCII representations. In particular, UTF-8 transforms UCS to an ASCII-friendly form wherein the characters assigned to the first 128 values of the ASCII character set are maintained as single octet strings, but all other characters are represented as variable-length multiple-octet strings of between two and six octets.
While UTF-8 is generally useful in many circumstances, it is not useful to transform information represented in UCS to an EBCDIC representation and/or vice versa. Unlike ASCII, EBCDIC includes 65 control codes, in addition to the 191 letter, number and punctuation characters (generally referred to as `graphic characters`). The graphics characters are located at 64 (0x40), for the SPACE character, and from 65 (0x41) to 254 (0xFE) and the control characters are assigned from 0x00 to 0x40 and the EO ("eight ones") character which is assigned to 0xFF.
Thus, for compatibility with computer systems using and/or information represented in EBCDIC, information transformed between UCS and EBCDIC-friendly forms must ensure that the 65 control characters be transformed to single-octet representations.
Further, despite the fact that there exist many EBCDIC coded character sets (commonly referred to as "code pages"), a subset of these characters, which is often referred to as the "invariant" character set, is assigned to common values in most or all code pages. For example, the upper case Latin characters "A" through "Z" appear at values 193 (0xC1) to 233 (0xE9) in all EBCDIC code pages. Similarly, the numerals "0" through "9" appear at values 240 (0xF0) through 249 (0xF9) in all EBCDIC code pages. While there are some exceptions to these invariant assignments, as discussed below in more detail, it is desired that a transform maintain the single-octet values to which the invariant character set is assigned.