1. Field of the Invention
The present invention relates to a method and system for exchanging data between programs using different encoding schemes, especially for exchanging data between different platforms using different encoding schemes or codepages.
2. Description of the Related Art
Many client/server applications exchange and share data between different platforms. The platforms may use different codepages either caused by different encoding schemes (ASCII, EBCDIC, Unicode) or caused by national language settings. ASCII stands for American Standard Code for Information Interchange, a code in which each alphanumeric character is represented as an 8-bit binary code for the computer. ASCII is used by most microcomputers and printers and on the Internet, and because of this, text-only files can be transferred easily between different kinds of computers. For the representation of national language characters a set of different ASCII codepages is defined.
EBCDIC stands for Extended Binary Coded Decimal Interchange Code, an 8-bit binary code for larger IBM computers in which each byte represents one alphanumeric character. Different EBCDIC codepages are defined as well to represent national language characters.
Unicode stands for a character set that uses 16 bits (two bytes) for each character, and therefore is able to include more characters than ASCII or EBCDIC. Unicode can have 65,536 characters, and therefore can be used to encode almost all the languages of the world. Unicode includes the ASCII character set within it.
The burden of detecting and managing different codepages is currently left to the application. Applications which have been developed for one platform (e.g. ASCII UNIX) cannot easily be extended to run in a heterogeneous environment and share data (e.g. AIX/6000 (ASCII) and OS/390 UNIX (EBCDIC)). Supporting a heterogeneous environment goes far beyond porting the application.
Furthermore, many applications depend on one encoding scheme (e.g. ASCII) while utilities provided by the operating system require that files contain the data in their native encoding scheme (e.g. OS/390 UNIX System Services expects EBCDIC files).
Porting applications from an ASCII-based platform to EBCDIC-based platform, such as OS/390, often involves a time-consuming analysis of any character set encoding used with the program itself and in data passed to the program from the user or a file. For data passed into an application from a file, methods are required to recognize if the file contains encoded characters, and if so, what coded character set was used.
U.S. Pat. No. 5,784,544 describes a data type detection facility for determining the data type of an incoming stream of data. The characters of the data stream are first tested to determine if they are valid characters of one data type (e.g., EBCDIC). A count of the valid characters is obtained. Then, the data stream is assumed to be of another data type (e.g., ASCII), and the characters of the data stream are translated from that data type to the first data type. After the translation, the same test for valid characters is made and another count is obtained. The two counts are then compared to determine the data type of the data stream.
This assumption technique may cause the following problems:    1. The assumption may be incorrect which would result in wrong conversion. This is uncritical if the data is presented to a human being that is able to ascertain the correctness. For example if the data is displayed or printed incorrect conversion results in an unreadable presentation which can be detected easily. Indeed, printing is mentioned as implementation example in this patent. The assumption technique is unacceptable if relevant business data is to be processed by another program because it could result in lost or wrong data. Furthermore, the assumption technique is only applicable if the language or language group (e.g. Latin1=Western European Languages) is known. The described method would not be applicable to distinguish between codepages belonging to the same encoding scheme, for example, between EBCDIC French and EBCDIC Czech. Finally, the assumption technique also requires that a reasonable amount of data is available to be tested. Some implementations check the first 256 characters before making a decision. If only a few characters are available the method may fail.    2. Performance: Because a reasonable amount of data has to be inspected before data can be processed this method causes some processing overhead.
It is therefore an object of the present invention to provide a system and method allowing an improved exchange of data or files which are being coded in different encoding schemes between different programs which use only one encoding scheme.
It is a further object of the present invention to provide a system and method allowing an improved exchange of data or files within a heterogeneous environment.
Finally, it is an object of the present invention to provide a system or method allowing an improved exchange of data or files without requiring adaptations either on the data or the files or in the program code itself.