The growing trend toward multinational organizations has given rise to a corresponding need for fast, efficient, and accurate data conversion between various computer character sets generally corresponding to different human languages.
Typically in computing systems, the internal representation of characters is designed for one alphabet. For example, a computing system may be designed to represent western European characters corresponding to the languages that use this alphabet (e.g., English, French, German, etc.), but would not be able to represent languages using other characters (Cyrillic, Arabic, Japanese, Chinese, etc.)
Computer representation of characters typically assigns every character of the alphabet a unique numeric value. This means that a character set that represents each character using 8-bits can have only 256 characters. A 256-character character set is sufficient to represent the western European alphabet or Cyrillic (though not concurrently), but is insufficient for languages that employ more characters (e.g., Japanese, Chinese, etc.). Languages having large character sets have employed a two-byte (16 bit) representation of characters. Such character sets may employ a multi-byte encoding, with, for example, the first byte indicating the number of bytes used to represent the character. Such encoding did not provide the capability to combine character sets. So, for example, it was not possible to combine western European and Japanese or Japanese and Chinese character sets.
Unicode was developed to cover all major languages and character sets. Unicode represents each character using 16 bits and therefore can uniquely identify more than 60,000 characters. This means that a Unicode character set acts as a superset for all the existing character sets for various languages, alphabets, or character sets.
However, the majority of extant systems are not Unicode and there is, therefore, substantial need for conversion between various character sets. A computing system using Unicode can communicate with external computing systems employing various character sets, but there must be a conversion between Unicode and the character set of the external computing system.
A character set conversion problem may occur when converting code between Unicode and other character sets. For example, consider a multinational organization having a Unicode-based central computing system communicating data between two external computing systems the first using a Chinese character set and the second using a Japanese character set. There may be a need to send data from the first external computing system to the second external computing system. Chinese character set characters converted to Unicode may not be entirely convertible to Japanese character set characters. So, when the Japanese external computing system attempts to convert the Unicode it will not be able to represent Chinese character set characters. This will cause a character conversion failure.
Typically, one of two strategies is employed when a character conversion failure occurs. The first is to use a replacement character for the unknown character. The replacement character indicates that proper conversion did not take place for the particular character, but the conversion continues. The second, known as a “hard error” means that the conversion is halted, that is, the data is not converted.
Depending on the situation, either strategy may be preferable. For example, for display data it may be better to provide as much valid data as possible and provide a substitute character where proper conversion has not occurred. On the other hand, invalid conversion of a financial transaction may warrant termination of the conversion and other corrective action. In either case it may be prudent to provide a notification of the failed conversion. This could take the form of an entry to a log file or event log and may precipitate manual corrective intervention.
The character conversion can be done in one of two places, it can be done at the central Unicode-based computing system, or it can be done at the external computing system. If the conversion is done at the central computing system, there are usually specified error-handling procedures and corrective mechanisms. That is, the central computing system has complete control of the conversion. However, if the data conduit is a Unicode-enabled communication stream (e.g., xml), then the Unicode-based system simply transmits the Unicode data and the external computing system must complete the character conversion. That is, the outgoing data format must match the format supported by the data communications stream.
Typically, data sent across a communication stream may consist of three types: textual data content, non-textual data content, and message format and control information. The textual data content is that portion of the data that is encoded using the character set of the communication stream and is converted to the character set of the external computing system upon receipt. The non-textual data content (e.g., image data), on the other hand is not subject to character conversion. The format and control information, which may or may not be textual, is not part of the data content and subject to character conversion when received by the external system. External computing systems use a variety of format and control information, which may employ a hierarchical or other structured format. A central computing system may use a single hierarchical format when storing data internally. Such a central computing system may store metadata, which is data that describes how to convert the internally stored data into the format and control information of an external system. This metadata, which is not transmitted with the message data, describes how to add any necessary format and control data required for the particular message format being transmitted. That is, as data is received from various systems over various communications streams, the particular message format and control information is replaced with common message format and control information. When the data is transferred over a given communication stream, the common message format and control information is replaced with message format and control information corresponding to the communication stream. The outgoing communications stream often supports Unicode, necessitating character conversion at the external computing system.
The problem with performing the character conversion on the external computing system is that many of these external computing systems, using various character sets, provide only limited or constrained error handling capabilities. For example, some database systems provide character substitution only with no hard error or notification capabilities. This can be extremely problematic where a hard error or notification is required. If the external computing system ignores conversion errors, the data on the external computing system may be corrupted. This may cause the external computing system to behave in an incorrect fashion.
Moreover, at a later time, the corrupted data may be sent back to the central computing system. Typically, a standard substitution character is a common character (e.g., “?”) and may go unnoticed by the central computing system. Because the central computing system is unaware of the data corruption, the central computing system stores the corrupt data in its database, overwriting the valid data. This is known as round-trip error and can significantly compound the problem of corrupt data as other internal and external applications access the corrupt data. The problem is compounded if the now corrupt data is subsequently sent to other external computing systems.
Even if the external computing system handles the character conversion errors in an appropriate manner, the external computing system does not provide notification of the error to the central computing system. This means that corrective action (e.g., validation and retransmission, updating log file, etc.) does not occur.