1. Technical Field
This invention relates to the characterization and conversion of data for interprocess exchange. More particularly, the invention relates to establishing the context in which data exchanged between dissimilar (heterogeneous) relational database management systems can be mutually understood and preserved, and data conversions minimized.
2. Description of the Prior Art
Currently there is great interest in joining together multiple database management sites to form a distributed system which provides any user at any site with access to data stored at any other site; see for example, AN INTRODUCTION TO DATABASE SYSTEMS, Vol. 1, by C. J. Date (4th Edition, 1986), at pp. 587-622. Date envisions that each site would constitute an entire database system with its own database management system (DBMS), terminals, users, storage, and CPU.
In a distributed database system such as the type described by Date, the DBMS at any site may operate on a machine type which is different than the machine type of another site. Indeed, there may be as many different machine types as sites. For example, the IBM Corporation (Assignee of this patent application) has DBMSs which operate on System/370 machines, AS/400 machines, and PS/2 machines.
The machines upon which the DBMSs of a heterogeneous database system run all represent information in different internal formats. For example, numeric information on PS/2 machines is stored with the bytes in low order to high order sequence. On other machines, such information may be stored in high order to low order sequence. For floating point information, there are IEEE floating point machines and hexadecimal floating point machines. Character information is processed in many different code representations, the choice of which reflects historical or cultural roots.
As DBMSs grow and evolve over time, they may be embodied in a series of versions or releases. Each of these may require additional information to be exchanged in a distributed database system. When these changes are introduced, all sites must be informed.
When a database program is written, compiled and executed entirely in one environment (machine and DBMS), it rarely is sensitive to the exact representation of the data which it processes. The data compiled into the program and the data stored in database structures are all represented identically so the operations behave as expected. Thus, a COMPARE command executed in a single database environment can always be made to manipulate data correctly, just by using the high level language operations of the system.
Thus, given disparity in machine types and the ever-evolving nature of DBMSs, it is inevitable that a distributed database system can be heterogeneous in the sense that any site may manage a database by means of a combination of machine and DBMS which is different from the combination at another site.
Provision is made in the prior art for solving the problems of machine and system incompatibility in a distributed, heterogeneous database system. Three solutions are of interest.
The earliest solution may be termed "application beware". This solution usually starts as a connection between identical database systems which grows over time to incorporate some machines which differ slightly from the original. In these solutions, there is no way for the system to automatically handle the differences, with the result that the application program was given this responsibility. If access to heterogeneous databases was needed badly enough, the application was written to make any necessary accommodations.
The second solution utilizes a canonical representation of data. This approach calls for conversion of data into a single, generic (canonical) representation before transport from one database site to another. Superficially, this solves the problem of automating the system to handle differences between differing databases. However, this approach requires many extra conversions which are inefficient, and introduces many conversion errors, making the approach inaccurate. For example, conversion of a floating point number always requires rounding off, with a concomitant loss of accuracy. When convening from one to another floating point representation, say, from IEEE to hexadecimal, precision is lost. In changing from hexadecimal to IEEE, scale is lost. Where character translations are performed, many of the special characters are lost because of lack of equivalence between character codes. In this solution, conversion errors which do occur are introduced at a point in the process far removed from the application. This increases the difficulty of identifying and responding to errors.
The last solution employs a gateway conversion in which a central facility is responsible for matching any database representation to any other. Ideally this reduces the inefficiency, inaccuracy, and error propensity of the canonical representation since conversions can be avoided when they aren't needed. However, inter-site communication is lengthy, slow, and expensive. The gateway is a single node to which all inter-site paths connect for all interactions. Instead of a request and response between the two participating sites, there are two requests and two responses for every data transfer. When conversions are required, they are still done in a part of the distributed system which is remote from the application.
Thus, there is an evident need in distributed, heterogeneous database systems to support effective and accurate exchange of data, while reducing the number of conversions, and the communications overhead. It is also desirable to perform any needed conversion at the site where the data to be converted will be processed.