The present invention relates to the compression of computer data readable by a computer system, and in particular methods of compressing a collection of data blocks that are predefined, relatively independent portions of data. More particularly still, the present invention relates to compressing a collection of data blocks such that the blocks remain independently decompressible, i.e., randomly accessible, while utilizing similarities among the collection of data blocks to achieve better compression.
In many distributed environments, many data files, and hence significant amounts of data, are regularly transferred between different computer systems. In general, some transfer connections are faster than others, but it is well known that it takes less time to transfer smaller portions, blocks or files of data than larger ones. For this reason and others, many efforts have been made to reduce file sizes, such as through the use of compression techniques.
Compression relates to the conversion of one set or string of data to another, smaller set or string of data. Importantly however, the second, shorter set of data represents the same content as the first, longer set such that the shorter string may be generally xe2x80x9cdecompressedxe2x80x9d back into the original string. Typical compression algorithms capitalize on redundancy to compress data. In essence, these algorithms replace redundant or often occurring phrases or portions of data with other, shorter versions. In order to accommodate such replacements, some portions of data that are not used often may be replaced with somewhat longer versions of data. However, this tradeoff is acceptable in many scenarios due to the overall or net size savings in reducing redundancy.
One common technique in compressing data relates to the use of Huffman coding. Huffman coding requires an analysis of data to determine probabilities of redundancy for the data. Upon determining relative probabilities for the data, variable length codes may be assigned to the various data phrases. Often, the variable length code assignments may be represented in the form of a xe2x80x9ccontext,xe2x80x9d which, in this case, the context is a Huffman tree. Contexts essentially provide a decompressor with the necessary information to decode, or decompress a compressed stream. Huffman trees and other variable-length contexts may be used in combination with other compression methods as well, such as xe2x80x9cLZ-77xe2x80x9d compression techniques, where literals and pointers may be assigned variable length codes as dictated by the context or Huffman tree.
In operation, there are essentially two ways in which contexts are used. One method relates to the creation of a generic, or stock context that may be used with all sets of data. For instance, a stock, English-alphabet context may be created which assigns variable length codes to letters of the alphabet based on a generic cross-sectional analysis of various documents. Such an analysis may provide short codes to letters xe2x80x9cexe2x80x9d and xe2x80x9caxe2x80x9d while assigning longer codes to letters xe2x80x9czxe2x80x9d and xe2x80x9cqxe2x80x9d since the former symbols tend to occur more often than the latter symbols. These stock contexts may then be incorporated into a decompressor module and used to decompress any document having English letters. A small identifier sequence may be provided with the compressed data indicating that the data is in English such that the decompressor recognizes which stock context to use. One benefit to the stock context is that only a small portion of code is attached to the compressed code itself such that little overhead is lost in providing information to the decompressor about the type of context, e.g., which Huffman tree to use in decompressing the data. Further, since the decompressor is familiar with the stock context no time or space is lost in communicating either the context itself or how to use the context to the decompressor.
Unfortunately however, one drawback in using stock contexts or Huffman trees relates to the fact that each stock context is based on a generic analysis of many different documents or streams of data. These other documents are typically significantly different from the actual document being compressed and decompressed using the stock context. Consequently, the actual compression ratio is far from optimal, i.e., the resulting compressed data stream is actually larger than it needs to be, and often, the resulting compressed data stream is unsatisfactorily large.
One solution to the problems associated with using stock contexts or Huffman trees relates to using customized contexts for each stream of data being compressed. Customized contexts are created by a method that analyzes the actual data stream or document to be compressed and determines the best use of variable length codes to provide the smallest compressed stream of data for that data stream. For instance, a document written in the English language may frequently use the letter xe2x80x9czxe2x80x9d and infrequently use the letter xe2x80x9caxe2x80x9d such that the customized context, e.g., Huffman tree may assign a relatively small code to the letter xe2x80x9czxe2x80x9d and relatively large code to the letter xe2x80x9ca.xe2x80x9d In such a case, the resulting compressed stream would be smaller than if the stock context described above was used.
Unfortunately however, customized contexts suffer some drawbacks as well. In particular, when a customized context is created and used, the context itself must be transferred to the decompressor. Typically, the context is appended to the front or back of the compressed stream of data. Attaching the context in this manner increases the overall size of the data to be transmitted, which may increase transmission time. This phenomenon is particularly important when the original stream of data is relatively small. In such a case, the combination of a customized context and the compressed data stream may actually be larger than the original data stream. Stated another way, the compression ratio, where the compression ratio equals the size of the output data stream divided by the size of the input data stream, may be greater than 1 when the customized context is appended to an output data stream for a relatively small input data stream.
Typically, when compressing many smaller blocks of data, concatenating the smaller blocks or objects and compressing them as one decreases the compression ratio. Although compressing the blocks as a single unit decreases the compression ratio, this technique suffers a significant drawback. In particular, in order to access a single one of the blocks of data, the entire concatenated set of blocks must be decompressed. Hence, compression, transfer and decompression time may take too long. Stated another way, there is no way to randomly access a block. For instance, if the eleventh block is needed, the first ten blocks must be decompressed before accessing the eleventh block. This situation is particularly true when the compression algorithm uses an LZ-77-type technique involving interleaved pointers and literals, where the pointers point to redundant elements located in the separate blocks. Consequently, all blocks must be decompressed prior to accessing any one block, i.e., there is no random accessibility.
It is with respect to these and other considerations that the present invention has been made.
The present invention relates to a system and method that provides a unique or custom context for a plurality of blocks of data, yet compresses the blocks independently from the others, such that each block is independently decompressible. The method analyzes a collection of blocks for compression, and computes a unique context, such as a Huffman tree given the distribution of symbols or phrases across all the blocks in the collection. The unique context is used to decompress the blocks and may be referred to as a xe2x80x9cshared decompression context.xe2x80x9d Consequently, rather than the decompressor having a fixed or known stock context for all blocks, the collection-specific, shared decompression context is included in an initial transfer and this shared decompression context is used to decompress the blocks. Each block in the collection can be independently decompressed, in any order, using the shared decompression context. Providing a context based in part on the specific symbol-distribution characteristics of each block improves compression ratios over the use of a single, global, stock context because the context codes are computed across a distribution of symbols unique to a particular collection, not a generic cross-section. Additionally, creating a shared context is generally more efficient than computing and encoding a separate context for each independent block, especially when the blocks are relatively small.
The present invention involves any data compressor where a collection of blocks needs to be independently decompressible, i.e., decompressible in any order which is often referred to as providing xe2x80x9crandom access.xe2x80x9d The blocks of the collection share some commonality in their composition such that a common decompression context, whether using Huffman trees or some other coding provides useful compression characteristics. Therefore, in accordance with certain aspects, the present invention may be implemented using Huffman coding with or without LZ-77 compression methods. Alternatively, the present invention may use other coding techniques.
In accordance with certain aspects, the present invention relates to a system and method that compresses a plurality of blocks of data, wherein the blocks are first analyzed to determine unique characteristics of the set of blocks. For instance, the blocks are evaluated based on symbol distribution and probabilities of occurrence, e.g., using variable-size coding such as Huffman coding and/or LZ-77 type analysis using literals and pointers. Upon completion of the analysis phase, a shared context is dynamically created for the plurality of blocks. Next, each block is compressed in a manner such that each compressed representation of a block, i.e., each compressed block, can be decompressed independently of the other blocks. The blocks may be combined into a collection and the shared context would therefore be associated with the blocks of the collection. The shared context is generally separable and independently transferable from the blocks.
In accordance with other aspects, the present invention relates to a system and method for decompressing compressed blocks of data wherein the compressed blocks of data are associated with one or more other blocks of data in a set or collection of data blocks. Besides accessing the one or more blocks, a shared context is also accessed, wherein the shared context has decompression information for the one or more blocks of compressed data as well as one or more other associated blocks of data. The one or more blocks of compressed data are then decompressed. However, not all of the associated blocks that relate to the same shared context need to be decompressed to decompress any one block of data. The decompression may occur on client computer system, remote from a server computer system and the shared context may be transferred to the client computer system independently of the blocks of data.
In accordance with other aspects, the present invention relates to a computer system having a compression module for compressing a set uncompressed data blocks. The compression module has an analysis module that analyzes the blocks for compression characteristics and dynamically creates a shared context. The compression module further has a build module for compressing the blocks using the shared context. The computer system may further have a receive module for receiving the uncompressed data and a send module for sending information to a second computer system. The send module sends either the whole shared context or just a shared decompression context to the second computer system and may also send one of the compressed blocks of a data to the second computer system for decompression. In an embodiment, the second computer system decompresses at least one block of compressed data independently of the other blocks in the collection.
The invention may be implemented as a computer process, a computing system or as an article of manufacture such as a computer program product. The computer program product may be a computer storage medium readable by a computer system and encoding a computer program of instructions for executing a computer process. The computer program product may also be a propagated signal on a carrier readable by a computing system and encoding a computer program of instructions for executing a computer process.
A more complete appreciation of the present invention and its improvements can be obtained by reference to the accompanying drawings, which are briefly summarized below, to the following detail description of presently preferred embodiments of the invention, and to the appended claims.