General information on various data compression methods can be found in the book I. Witten et al: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd ed., Morgan Kaufmann, 1999.
Huffman coding (Huffman, D.: A method for the construction of minimum-redundancy codes, Proc. Inst. Radio Engineers 40(9):1098-1101, 1952) is an old and widely known data compression method. In general-purpose compression applications it has long since been surpassed by more modern compression techniques, such as arithmetic coding, Lempel-Ziv, Lempel-Ziv-Welch, LZ-Renau, and many other systems.
Several variations of Huffman coding exist for compressing dynamic data streams, i.e., data streams where the frequency distribution of the various tokens to be compressed is not known a priori or may change dynamically with time, even during the compression of a single data stream. Examples of dynamic Huffman coding schemes include J. Vitter: Design and Analysis of Dynamic Huffman Codes, J. ACM, 34(4):825-845, 1987; Y. Okada et al: Self-Organized Dynamic Huffman Coding without Frequency Counts, Proceedings of the Data Compression Conference (DCC' 95), IEEE, 1995, p. 473; D. Knuth: Dynamic Huffman coding, J. Algorithms 6:163-180, 1985; and R. Gallager: Variations on a theme by Huffman, IEEE Trans. Inform. Theory, IT-24:668-674, 1978.
Splay trees (D. Jones: Application of splay trees to data compression, Communications of the ACM, 31(8):996-1007, 1988) have also been tried as an alternative to dymamic Huffman coding, but they have been found to not achieve a good compression ratio according to Okada et al.
A problem with existing dynamic Huffman coding schemes is that they modify the coding tree every time a token is encoded or decoded. Since dynamic Huffman coding schemes typically utilize tree data structures, modifying them is fairly expensive. The codes changing constantly also makes decoding optimizations difficult, more or less forcing decoding to operate a bit at a time.
There are applications requiring a dynamic compressor with very high speed. One example is loading knowledge into a knowledge-intensive application during startup. Such applications may use knowledge bases of several terabytes, and may run on computers with tens or hundreds of gigabytes of main memory, and may require tens or hundreds of gigabytes of data to be loaded into main memory before they can operate at full performance. Loading such data amounts from persistent storage into the application's memory can be quite time consuming, especially if the loading is done over a communications network. For example, consider a computing cluster with a thousand computational nodes, each node loading 100 gigabytes of knowledge into its memory. The aggregate data amount is 100 terabytes; transmitting this over a network or out of a database at 10 gigabits per second would take 80000 seconds, or over 22 hours, just for the system to start up. Even just reading 100 gigabytes from current disks takes many minutes. In such systems, it is important to compress the data, but since every node will need to also decompress the 100 gigabytes of data, decompression will need to be extremely fast.
If the 100 gigabytes represents 5 billion objects, at a mere 100 nanoseconds per object (which is probably highly optimistic) the decoding would take 500 seconds of CPU time, which is a long time to start up the application. Likewise, since the data set will likely need to be frequently updated, encoding speed is important, or otherwise it will take hours or days to write a new data set.
Such data sets need to be processed in a single pass, and their token frequency distributions are not known a priori. A dynamic compression scheme is thus needed.
No known compression scheme fills these requirements with regards to performance.