People communicate using characters. To encode these characters in a computer, encoded representations of each character are defined as standards. Sending and receiving computers using the same standards are thus able to communicate with each other: the receiving computer can display a copy of what was sent by the sending computer. The ASCII character set is an example of a standard character set used in a computer, although other standards such as EBCDIC have been used.
Different languages may be represented using different character sets. The ASCII character set is used to represent the English language. Other Latin-based languages may also be represented by the ASCII character set along with one or more standard character set extensions to provide a complete set of characters for the language. For example, the ASCII characters and other characters such as `c` is used in France, but `c` is not part of the ASCII character set. The standard Latin-1 supplement character set contains `c` and thus supplements the ASCII character set for the French language.
Other languages are represented by a completely different character set. For example, the Hebrew language is written in characters different from those in the ASCII character set, and various Hebrew character set standards have been developed for communicating in Hebrew.
ISO, the international standards organization, has mapped many character sets, including symbol character sets, into two global character supersets, UCS-2 and UCS-4. The UCS-2 character superset uses two bytes to encode a wide variety of characters which might otherwise require multiple character sets. The upper bits of the two bytes of UCS-2 characters generally, though not always, define a group of related characters, such as Basic Latin characters, Hebrew basic and extended characters or Thai characters, and the lower bits defining the character within the group. A "byte" is any group of 8 bits, and is sometimes referred to in the art as an "octet". If both sender and recipient software uses UCS-2, communication is possible in a wide variety of languages using a wide variety of characters from different groups. Some languages would otherwise require the use of two or more character sets, so the UCS-2 superset standard of encoding may be used to represent the characters in each set.
The UCS-4 superset uses four bytes to represent characters, presently in the same manner as UCS-2. The UCS-4 representation of any character is equal to the UCS-2 representation in the lower two bytes and 0x0000 in the upper two bytes.
The notation "0x" implies that the adjacent digits to the right of the 0x are hexadecimal digits. Thus, 0x10 is 16 in decimal, which is the same as 00010000 in binary.
Because most individuals communicate with the recipient of a message using the same one or two character sets as the sender, and because most character sets use no more than 7 or 8 bits, the UCS-2 and UCS-4 character sets are inefficient, because transmission times and storage space are larger than necessary to represent the few character sets used. For example, the ASCII character set uses 8 bits with the Most Significant Bit (MSB) equal to 0. When UCS-2 is used, every character requires 16 bits, doubling the length of information transmitted and stored.
More efficient transformations of the UCS-2 character set have been employed that provide the benefits of a character superset using fewer bytes per character than UCS-2, by adding special characters known as "shift" or "shift lock" characters that indicate how a subsequent character or subsequent characters should be interpreted.
Shift and shift lock characters are used to change the interpretation of one or more characters which follow it. The "shift" character changes the interpretation of the character immediately following it, similar to the shift key on a typewriter. A "shift lock" character changes the interpretation of the characters between the shift lock character and a "shift unlock".
Two character transformations are known as UTF-8 and UTF-7. UTF-8 uses 8-bit characters and encodes ASCII characters identically with their ASCII encoding. UCS-2 characters outside the ASCII character set and below 0x7FF use a two byte encoding per character, and those higher than 0x7FF use three bytes per character. UCS-4 characters require additional bytes. Thus two non-ASCII-encoded bytes of UCS-2 characters will result in a character of four or six bytes under UTF-8. The high order bits of each encoded byte of each character are similar to shift characters, indicating any transformation required to convert the current character back to its UCS-2 form.
UTF-7 also encodes ASCII characters using their ASCII encodings. UCS-2 characters outside the ASCII character set are transformed using a shift lock character. A shift lock character sequence has a special character at the beginning, a special character at the end, and a transformation rule for all characters in between the special character at the beginning and the special character at the end. To encode in UTF-7, the special character at the beginning is an ASCII `+` and the special character at the end is an ASCII `-` or any character that is not in the ASCII set `A` through `Z`, `a` through `z`, `0` through `9`0 `+`, `/` and `=`. Characters in between the special characters are encoded using "base 64 encoding".
"Base 64 encoding" is a process that converts a string of one or more bytes into a string of a limited set of ASCII characters. The bits in the unencoded string are encoded by first grouping them into sets of 6, starting with the most significant bit, and encoding each set of 6 bits into a byte. The encoding process uses Table 1 attached hereto, encoding each six bit set into the value of an ASCII character in a limited ASCII character set. The limited character set is used to ensure that other ASCII characters, which can cause uncertain results in some mail applications, are not used. If the string of unencoded characters does not have a number of bits that is evenly divisible by 6, i.e. with no remainder, the unused bits in the last set of 6 may be zero filled.
For example, a two byte UCS-2 character, Cyrillic `B` or 0x0421, is not an ASCII character and therefore is encoded by UTF-7 using an shift lock character of `+`, and a shift unlock character of `-` and the bit stream of 00000100 00100001 base 64 encoded by grouping the sixteen bits into three groups of six bits per group, 000001 000010 000100 with the trailing zeroes added. Using Table 1, the groups translate into `B` `C` `E` which are encoded in ASCII as 0x42 0x43 0x45, and with the addition of the shift lock and unlock characters, would read 0x2B 0x42 0x43 0x45 0x2D. Thus two bytes that are not ASCII encoded will result in 5 bytes under UTF-7. UTF-7 has an exception for encoding the ASCII `+` for situations when it is not being used as a shift lock character. The ASCII `+` is encoded in UTF-7 as the shift lock and unlock characters with no characters in between, `+-`.
"Base 64 decoding" operates in reverse of base 64 encoding. Each byte in a character is decoded using Table 1 to create a 6 bit result, and the results from a single character are concatenated to create the decoded character, removing any zero filled least significant bits, identified by the size of the resulting concatenated bits. For example, UCS-2 characters have 16 bits, and base 64 encoded UCS-2 characters use 3 bytes to produce 3 sets of 6 bits or 18 bits.
UTF-7 is a "mail safe" encoding. A "mail safe" encoding is one which never uses the MSB of each byte, because some electronic mail applications use the MSB for other purposes. In addition, a mail safe encoding restricts the use of some ASCII codes which have special meaning to certain e-mail systems.
An encoding is more "efficient" than another encoding when it represents characters using a smaller number of bits than the other encoding. Because characters in a message may be transmitted and/or stored, it is desirable to encode characters in such a manner that minimize the number of bytes required to represent most messages. Therefore, a method and system are needed to create a transformation of USC-2 and/or UCS-4 that is mail safe, more efficient than UTF-7 and at least as efficient as UTF-8.