Serialization can be defined as the process of storing the state of an object instance to a storage medium. During this process, the public and private fields of an object and the name of the class, are converted to a stream of bytes, which is then written to a data stream. When an object is subsequently deserialized, an exact clone of the original object may be created.
Consider an object in active computer memory, for example, an object with data describing a person. The person object has a number of subcomponent members, such as name, address, social security number, phone numbers, spouse, height and weight. While the person's name may be important for a particular application, the height and weight may not be. Thus, the name may remain in active memory where it may be modified, while other fields such as height and weight are evicted from active memory to make room for other data. Ultimately, the person object may no longer be needed by the application, and it may be persisted or transmitted to another computer. To persist or transmit an object, the object must be serialized, which refers to formatting an object in a useful, retrievable way.
In the example above, the members of an object, such as the person object, are generally uniform for all objects of the same class. Each person object, for example, has the name, address, social security number, phone numbers, spouse, height and weight members. The information changes from person to person, and for some people the information may be unavailable (“null”), but the existence of the same member fields is generally present for all person objects of the person class. As such, a person class may be thought of as the generic person object. A person object is one instance of a person class. This concept of a class and an instance of a class exists in many programming languages. Regardless of the programming language involved, serialization is typically performed on instances of a class, generating serialized objects.
Objects may comprise members with various types of data. The members may be primitive or complex. Examples of primitive members are “string” such as the name member from the person object, which is a string of letters; and “integer,” such as the social security number from the person object, which is an integer. Examples of complex members are “collection,” such as the phone numbers member, which comprises more than one primitive—in this case, more than one integer; “nested,” which is a member that has some structure beyond a simple primitive member, e.g., the collection of phone numbers, or the spouse member, which refers to another person object; and “subtype,” such as a hypothetical “United States address” type that would be a subtype of an address type, and therefore presumably declares additional members such as a U.S. region or U.S. Post Office Box. Members may be described in many different ways, and relate to each other in any number of patterns. Therefore serializing objects such as the person object involves effectively dealing with the various members and the relationships of those members that may be included in the object.
Serialization of objects presents a number of challenges in the industry. Serialized objects should consume as little storage space as possible. If the size of an object is greatly increased when it is serialized, then the storage cost of the object may be too high. Therefore, compact representation is an important aspect of a serialization format.
Serialized objects should also be efficiently instantiated into active memory. If the processing cost of finding and assimilating the various members of a serialized object is high, it will drain valuable processor resources. Likewise, serialization should allow for instantiation and updating of members of an object without the need to instantiate the entire object security number is a waste of active memory resources needed to store the name, phone number, address, etc. when those members are not involved in the operation.
Serialization formats should also support all data types that may be contained in an object. A very basic serialization format might only support primitives, but more sophisticated formats should support complex members such as the nested members, collection members, and subtype members described above. While a serialization format should be optimal for objects with few levels of nesting and inheritance, because most objects have this characteristic, it should also support many levels of nesting and inheritance to ensure that the serialization can be flexibly used for a broad range of classes. A serialization format should also be flexible in handling very large members. Some members may be, for example, a music file, a photograph, or a movie, and such large members pose a challenge in serialization that will be explained in greater detail below.
Previous serialization formats have several notable deficiencies. One such format is known as XML Serialization. XML serialization provides a token for each member. The token comprises metadata that identifies a member, usually a member immediately following the token. Therefore, XML serialization may be visualized as follows:(token 1) Member 1; (token 2) Member 2; (token 3) Member 3; etc.
The problems with such a serialization format are, first, verbosity: the storage of metadata tokens with each and every member consumes a large amount of disk space. Second, retrieval is impaired in such a format, because in order to find a desired member, the tokens must be searched. This may involve a high active memory cost, because the most effective way to read or update an object that is serialized in this manner may be to instantiate the entire object.
Another serialization format is in the “Storage Engine record” format, also referred to as the “SE record,” or simply “record” format. This is an a typical database system record format. In this serialization format, members for objects of a given class are stored in uniformly formatted records. Instead of providing metadata that describes each and every member, there is metadata that describes the contents of all the records for objects of a particular class. This can be visualized as provided in FIG. 10.
The SE record serialization format does not require metadata with each individual member, so it is a more compact serialization technique. Instead, it requires access to metadata describing the layout of the members on disk, such as the Metadata for Person Objects table of FIG. 10. A weakness of the SE record format is that it is inflexible in handling members of variable length, such as many of the music files, movies, and images that are stored with objects today. More accurately, flexibility in the SE record serialization comes at a high processing cost. Members of variable length can be stored in such a format, if an offset table is used to identify the locations of variable length data in the record. The consequence of storing an offset table is that whenever a variable length member is updated, the positions of all variable length data that follows it must be adjusted. This can be compared to inserting bytes in the middle of an array—everything to the right of an insert point must be shifted right to make space for inserted new bytes.
Further, various storage formats have been designed to allow users of databases to efficiently store objects within a database. These storage formats can be better supported with a more flexible serialization format. For example, should be distinguished from the serialization format provided herein. For example U.S. patent application Ser. No. 10/692,225, titled “system and method for object persistence in a database store,” is directed to allowing a user to ‘import’ classes and methods written in an object oriented language like C# into a database. It further allows a user to store C# objects in a database and to invoke methods on the objects. It provides multiple flavors of persistence to a user. A user can define his own serialization format, use Common Language Runtime (“CLR”) serialization (provided by C# language itself), or let the SQL server store an object in its own format. These options, particularly the latter, provide a performance advantage, as MICROSOFT SQL SERVER® can retrieve or update some fields of an object without actually instantiating a C# object. Of course, some operations, such as method invocation, still require instantiation of a C# object.
Similar background and related technology descriptions may be found in U.S. patent application Ser. No. 10/692,227, titled “System and Method for Storing and Retrieving a Field of a User Defined Type Outside of a Database Store.” This application discusses filestreams in UDTs, which may be serialized according to the techniques described herein. Such advanced database technologies can benefit from a more flexible and higher performance serialization format. Likewise, improved techniques for performing operations on serialized objects would better support such advanced database technologies.
The trade-offs involved in serialization formats are thus metadata on-disk memory overhead of the format, versus active memory overhead of locating a member, versus processing cost of locating a member, versus cost of doing an update, versus flexibility in handling large fields. In light of these trade-offs, there is an ongoing and heretofore unaddressed need in the industry to raise the bar with respect to serialization techniques.