Serialization of data structures (also called linearization or marshalling) means converting a more or less arbitrary data structure to a string of bytes (or words), such that the bytes can be, for example, written to a file, stored in a database, sent over a network to another computer, migrated, or shared in a distributed object system. The bytes contain an encoding of the data structure such that it can later be read in (possibly in a different computer or a different program) and the original data structure restored.
Serialization is readily available in some programming languages or run-time libraries, including Java and C#. Many serialization implementations only support non-cyclic data structures; however, some support arbitrary cyclic or shared data structures and preserve any sharing. The serialization result can be either text (e.g., XML) or binary data. Clearly, the serialization result (typically a file) can also be compressed using any known compression algorithm suitable for compressing files or data streams.
The term serialization is frequently used to refer to synchronization of operations in concurrent programs, which meaning is completely different from the meaning used herein. The term linearization has a separate meaning in garbage collection, where it refers to relocating objects that reference each other so that they reside in nearby memory addresses, in order to improve cache locality.
A fairly detailed example of serializing cyclic objects is provided in Dongmei Gao: A Java Implementation of the Simple Object Access Protocol, MSc Thesis, Florida State University, 2001, where an algorithm for serializing a cyclic data structure into XML format for use in an RPC implementation is described in Chapter 3, which is hereby incorporated herein by reference.
General information on various data compression methods can be found in the book I. Witten et al: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd ed., Morgan Kaufmann, 1999.
Huffman coding (Huffman, D.: A method for the construction of minimum-redundancy codes, Proc. Inst. Radio Engineers 40(9):1098-1101, 1952) is an old and widely known data compression method. In general-purpose compression applications it has long since been surpassed by more modern compression techniques, such as arithmetic coding, Lempel-Ziv, Lempel-Ziv-Welch, L Z-Renau, and many other systems.
Several variations of Huffman coding exist for compressing dynamic data streams, i.e., data streams where the frequency distribution of the various tokens to be compressed is not known a priori or may change dynamically with time, even during the compression of a single data stream. Examples of dynamic Huffman coding schemes include J. Vitter: Design and Analysis of Dynamic Huffman Codes, J. ACM, 34(4):825-845, 1987; Y. Okada et al: Self-Organized Dynamic Huffman Coding without Frequency Counts, Proceedings of the Data Compression Conference (DCC '95), IEEE, 1995, p. 473; D. Knuth: Dynamic Huffman coding, J. Algorithms 6:163-180, 1985; and R. Gallager: Variations on a theme by Huffman, IEEE Trans. Inform. Theory, IT-24:668-674, 1978. A particularly fast dynamic Huffman coding method that uses periodic regeneration of the Huffman coding trees and precomputation to speed up operations was disclosed in the U.S. patent application Ser. No. 12/354,835 by the same inventor, which is incorporated herein by reference.
There are a number of known data formats and protocols that support compression. Compressed image and video formats, such as TIFF, GIF, PNG, JPEG, and MPEG, encode many rows of pixels using compression, sometimes combining information from multiple rows (e.g. by using the discrete cosine transform) or multiple frames. IP Compression (RFC3173), PPP Compression Control Protocol (RFC1962), and IP header compression (RFC5225) are examples from the data communications field. The ITU-T X.691 (ASN.1 Packed Encoding Rules) is a standard for compact encoding of data.
HDF5 (Hierarchical Data Format 5) is a data format for large data sets; it is widely used in e.g. scientific computing and visualization, and for other large data sets such as stock market data or network monitoring data. It supports filters on data fields, including filters that perform compression. The HDF5 User's guide mentions compression in many places, and pages 115-154 specifically discuss filters and compression.
A number of companies have developed compact encodings for stock market data. The NxCore product from DTN/Nanex is one; however, no description of its data format is available. A detailed description of one stock market data encoding method is given in US patent application 20060269148.
In database systems, individual records or individual fields can be compressed, and the best type of compression to apply to a field can be detected automatically. An example is described in U.S. Pat. No. 5,546,575. A similar feature is called “Row Compression” in the DB2 database system.
In some Lisp systems and other programming environments, packed representations are used for data structures in memory. D. Bobrow and D. Clark: Compact Encodings of List Structure, ACM Transactions on Programming Languages and Systems, 1(2):266-286, 1979, describes several space-efficient encodings for list data structures (it should be noted that they use the term linearization in its garbage collection meaning). Other references to compactly representing data structures include P. Sipala: Compact Storage of Binary Trees, ACM Transactions on Programming Languages and Systems, 4(3):345-361, 1982; Jon White: Address/memory management for a gigantic LISP environment or, GC considered harmful, Conference on LISP and Functional Programming, ACM, 1980, pp. 119-127; Z. Shao et al: Unrolling lists, Conference on LISP and Functional Programming, ACM, 1994, p. 185-195; Martin Elsman: Type-Specialized Serialization with Sharing, in Sixth Symposium on Trends in Functional Programming (TFP '05), Tallinn, Estonia, September 2005; R. van Engelen et al: Toward Remote Object Coherence with Compiled Object Serialization for Distributed Computing with XML Web Services, Workshop on Compilers for Parallel Computing (CPC), 2006, pages 441-455; M. Philippsen and B. Haumacher: More Efficient Object Serialization, in Parallel and Distributed Processing, Springer, 1999, pp. 718-732.
These applications are, however, different from serializing arbitrary cyclic and/or shared data structures of a program into an external representation from which the same data structures can be read into memory. Serialization as used here is an automatic process, where the application generally only specifies the data structure to be serialized, and gets back a string of bytes (possibly directly written to a file or a communications socket). While some languages, such as Java, allow specifying custom serialization functions for object classes, the process is still driven automatically, with the serialization system traversing the object graph, detecting cycles and sharing, and performing the appropriate encoding on the objects such that they can be restored. The function for deserializing an object graph is generally given a string (or a file or communications socket where the data is read), and returns a data structure, without requiring further interaction with the application. Serialization as used here is thus a rather different operation from image or video compression, or the compression of IP packets, the compression of stock market data, or improving locality during garbage collection.
There are applications, such as large knowledge-based systems, where the data structures to be serialized are extremely large, and may grow to billions of objects in the near future. Such data structures also tend to be cyclic and have extensive sharing. Very fast and memory efficient serialization methods will be needed for serializing such data structures.
For example, consider loading knowledge into a future knowledge-intensive application during startup. Such applications may use knowledge bases of several terabytes, and may run on computers with tens or hundreds of gigabytes of main memory, and may require tens or hundreds of gigabytes of data to be loaded into main memory before they can operate at full performance.
Loading such data amounts from persistent storage into an application's memory can be quite time-consuming, especially if the loading is done over a communications network. For example, consider a computing cluster with a thousand computational nodes, each node loading 100 gigabytes of knowledge into its memory. The aggregate data amount is 100 terabytes; transmitting this over a network or out of a database at 10 gigabits per second would take 80000 seconds, or over 22 hours, just for the system to start up. Even just reading 100 gigabytes from current disks takes many minutes.
In such systems, it is important to compress the data, but since every node will need to also decompress the 100 gigabytes of data, decompression will need to be extremely fast.
If the 100 gigabytes represents 5 billion objects, at a mere 100 nanoseconds per object (which is probably highly optimistic) the decoding would take 500 seconds of CPU time, which is a long time to start up the application. It is thus important to be able to decode data very quickly even if the communication bottleneck is solved using other approaches. No known encoding/compression method is fast enough.
The example illustrates how important it is for such applications to be able to serialize large data structures into a compact format that can be decoded extremely quickly. Furthermore, since the data sets also need to be updated regularly, generating such compressed data sets must be fast.