Data are not universally transportable from one computer to any other computer. Different computers, operating systems, programming languages, and application software often use different native forms or formats for processing and representing data in memory or other stored form. At least four different data representation problems are known: format representation; layout representation; alignment representation; and inheritance representation.
For example, several different formats can be used to represent numbers in a computer processor, floating-point unit, or memory. Some processors represent a numeric value in memory as a string of bits in which the least significant bit is at the lowest memory location. Other processors represent values with the most significant bit at the lowest memory location. One type of processor cannot properly interpret, or directly access and use, values stored in a memory that were created by the other type of processor. This is known as a format representation problem. Examples of such incompatible processors are the SPARC.RTM. and VAX.RTM. processors. SPARC processors store multiple-byte quantities with the most-significant byte first, whereas VAX processors store the least significant byte first.
Incompatibilities also exist in code generated from different host programming languages, even when generated for the same processor. For example, programming languages such as C and Pascal enable a programmer to express a set of information in a complex abstract data type such as a record or structure, but there is no universal protocol for representing such abstract data types in a computer memory. Thus, in general, a C language program cannot directly read, write, or use data represented by a Pascal language abstract data type. This incompatibility increases the complexity of computer systems and makes data interchange difficult and inefficient. In addition, such abstract data types may include pointers or addresses that direct a compiler or processor to another memory location a portion of the data of the abstract data type is located. Not all programming languages use or understand pointers. Some programming languages permit a pointer to reference the same abstract data type that contains the pointer. Such "circular references" are not compatible with all languages or platforms and cannot easily be transported over a network.
Further, different processors may represent a data type of a programming language in different ways. For example, the C language defines a floating-point numeric data type called "float". One processor may represent a float in four bytes while another processor may represent a float in eight bytes. Thus, data created in memory by the same program running on different processors is not necessarily interchangeable. This is known as a layout representation incompatibility.
Alignment representation presents yet another problem in data interchange. With some processors, particular values or data types must be aligned at a particular memory location. Certain platforms require that certain size scalar types reside at an address that is a modulus of the scalar size. For example, with SPARC processors, a four byte quantity (such as a float created by a C language program) must reside at a memory address such that the address value has no remainder when divided by four, i.e., a modulo four memory address. When data is interchanged, there is no assurance that the inbound information uses the alignment required by the computer receiving the information.
Still another problem is inheritance representation. Certain object-oriented programming languages, such as C++, support the concept of inheritance, whereby an abstract data type may inherit properties of a previously defined abstract data type. Languages that support inheritance provide extra pointer fields in memory representations of abstract data types or classes that use virtual base classes or virtual functions. Such "inheritance pointers" are instantiated by the host language into a class definition at runtime, and they are neither an explicit attribute of an abstract data type ("ADT") nor are they persistent. There is no standard for the representation of these pointers within a class. The base classes and functions are defined at runtime, and therefore the value of an inheritance pointer is not known until runtime, and is not persistent. Therefore, transmission from one system to another of an instance of an abstract data type that inherits properties from another abstract data type is not generally practical.
Character representation is another problem. Computers used in different nations of the world also may use incompatible character sets. For example, in the United States a string of bits may represent the letter "A" whereas in Japan the same string of bits may represent a Katakana character or pictograph. Thus, the same string of bits may have an entirely different meaning depending on the character set in use. Data formatted in one character set cannot be directly used or interpreted by a system that uses a different character set.
In a networked computer environment, these problems are more acute. A network may be a "heterogeneous environment" that comprises several different types of computers, platforms, or application programs. A programmer writing software for use in a widely distributed network has no assurance that a destination computer can understand information sent from a source machine. Moreover, many network communication protocols are best suited to the transmission of simple, linear strings of values or characters. Complex abstract data types, especially those with pointers, generally cannot be transmitted reliably over such a network in the same form used to represent the data types in memory. Also, when a pointer points to a large or complex collection of data values, such as a table of a database system, it may be impractical or inefficient to convert the entire table to a universal form for transmission over the network.
The general process of transforming data from a source representation to a uniform target representation is known as "pickling" data. An apparatus, process, or computer program product that can carry out pickling is known as a "pickler."
One approach to pickling data is provided in D. Craft, "A Study of Pickling Emphasizing C++" (Olivetti Software Technology Laboratory, paper STL 89-2, 1989). However, the approach proposed by Craft has several disadvantages. For example, in the Craft approach the external representation of an abstract data type is identical to its internal representation, unless an application programmer writes an encode and decode method that specify how to use a different external representation. Thus, an application programmer writing code used in a heterogeneous environment is forced to write encode and decode for every abstract data type defined by the programmer and for every platform. This is impractical in a highly networked environment or in a complex application program.
Further, Craft provides no way to adapt his pickling process to new types of external platforms. Instead, Craft requires the application programmer to account for differences in format, layout, and alignment representation when writing encode and decode. In short, the Craft approach cannot be adapted easily to new or different platforms.
In addition, the Craft approach fails to efficiently pickle abstract data types that include large data collections, such as database tables, or large arrays.
Another approach is to copy data from an object to an image, byte by byte, including all physical pointers. In this approach, the operating system of the host platform is modified so that when the object is reconstituted from the image, the pointers are valid. However, this approach requires complicated memory page mapping or other adjustment of the operating system, which is undesirable because it may adversely affect other application programs.
There is a need for an arrangement that provides rapid and efficient conversion or transformation of a data object from representation in one or more complex abstract data types to a linearized representation that is efficiently interchangeable among networked computer systems.
There is also a need for an arrangement that can convert a linearized representation of a data object into the original memory representation of the data object, including any complex abstract data types that form part of the object.
There is also a need for an arrangement that provides such data transformation and resolve incompatibilities in the format representation, layout representation, alignment representation, and inheritance representation of the original data object and the platform used for transport. There is also a need for an arrangement that provides such data transformation while resolving any circular references in the source data.
There is also a need for an arrangement that provides such data transformation while handling character set transformation.
There is also a need for an arrangement that provides such data transformation while efficiently handling transformation of nested tables and large arrays that form a part of an abstract data type.
There is also a need for an arrangement that provides such data transformation and allows an attribute of a linearized image of a data object to be read or written using the linearized image and without reconstructing the data object.
There is also a need for an arrangement that provides such data transformation in a way that is integrated with and takes advantage of a database system.
There is also a need for an arrangement that provides such data transformation using efficient, powerful and flexible metadata to describe the type, format, location, and other attributes of data that is to be transformed.
There is also a need for an arrangement that provides such data transformation and provides version control, so that the arrangement can convert a data object from one version to another when a linearized image of the data object is reconstituted into the data object.
There is also a need for an arrangement that efficiently provides data transformation for instances of abstract data types that include complex collections of data.
There is also a need for an arrangement that allows the linearized format to be re-configured so that an object can be converted into a format that is native to a new or heterogeneous machine.
There is also a need for an arrangement that can carry out such data transformation on portions of a large data object or portions of a large linearized image of a data object.
There is also a need for an arrangement that can carry out such data transformation in a way that result in an image which is easily stored in a high-performance database system, and efficiently transported and used in a complex network of different machines.
There is also a need for an arrangement that can carry out such data transformation efficiently in a homogeneous networked environment, so that unnecessary data conversions are avoided.