The present disclosure relates to the fields of data compression and/or system management, more specifically, a method and system of content based dynamic data compression.
A data set may be compressed before transmission across one or more networks. Data compression generally reduces the size of the data set and may therefore also reduce the transmission time and the amount of network bandwidth that is used. As a non-limiting example, the data set may be a data file created by an application program.
Data compression generally works by reading one or more uncompressed symbols from an uncompressed data set and encoding the one or more uncompressed symbols into one or more compressed symbols in a compressed data set. The compressed data set may be smaller in terms of the total number of bits required to store the compressed data set in comparison to the total number of bits required to store the uncompressed data set.
The compressed data set may be decoded to reproduce the uncompressed data set. If the decoding results in a perfect reproduction of the uncompressed data set, then the compression technique is said to be ‘lossless’. If the decoding results in a non-perfect reproduction of the uncompressed data set, then the compression technique is said to be ‘lossy’. As a non-limiting example, lossy compression may be desirable if the imperfections introduced by compression are acceptable and result in an additional size reduction. The JPEG standard used to compress images is an example of a lossy compression technique where some loss of image quality may be unnoticeable and lossy compression may result in a smaller compressed data set.
The one or more compressed symbols in the compressed data set may represent an individual uncompressed symbol or one or more control symbols. As non-limiting examples, the one or more control symbols may be a dictionary reference or a decoder instruction. The dictionary reference may point to an entry in a dictionary that is built during the compression process, during the decompression process, or both. As non-limiting examples, the dictionary may track the individual uncompressed symbols and/or sequences of such symbols that have appeared earlier in the uncompressed data set. The individual uncompressed symbols and/or sequences of such symbols appearing in the dictionary may then be represented in the compressed data set by the dictionary reference.
The decoder instruction may be a direction for the decoder. As non-limiting examples, the decoder instruction may direct the decoder to repeat a symbol for a number of occurrences, to insert a symbol that is located at a specific offset from a reference symbol, to change the reference symbol, to reset the dictionary and start building it over again, to place a symbol into the dictionary, or to mark the end of the data set.
In terms of the number of bits used, the one or more compressed symbols used in the compressed data set may be smaller than the individual uncompressed symbols that they replace, may be the same size as the individual uncompressed symbols, may be larger than the individual uncompressed symbols, or may be variable width. It may seem counterintuitive that the one or more compressed symbols may be larger than the individual uncompressed symbols, however an overall reduction in size may result from the one or more compressed symbols replacing a sequence of the individual uncompressed symbols which is longer in length than the one or more compressed symbols that they are replaced by. Where variable length symbols are used, the compression algorithm may rely on the fact that the length of symbols is tracked and determined in the same away by both compression encoder and the compression decoder such that both change the symbol length at the same point in the data stream.
Data compression techniques are known in the art. Non-limiting examples include Run Length Encoding (RLE), which is a form of lossless encoding where sequences of repeating symbols in the uncompressed data set are replaced by an individual control symbol and the individual uncompressed symbol in the compressed data set. As a non-limiting example, using RLE a sequence of 37 repetitions of the symbol ‘$’ may be replaced by the individual control symbol meaning ‘repeat the follow symbol 37 times’ followed by the individual uncompressed symbol ‘$’.
Differential Pulse Code Modulation (DPCM) is a form of lossless encoding where each subsequent symbol in the uncompressed data set is compared to a reference symbol and a distance between their code points is encoded into the uncompressed data set if it is below a distance threshold. DPCM takes advantage of the fact that the symbols in the uncompressed data set may cluster within localized portions of a data space and therefore the distance between the reference symbol and the individual uncompressed symbol may be represented using fewer bits than it would take to represent the individual uncompressed symbol. As a non-limiting example, the distance between their code points may be the difference obtained by subtracting one code point from the other code point. The distance may be a signed value and may therefore select a next symbol that is within a range of symbols established by the reference symbol. If the distance is greater than the distance threshold, then the reference symbol may be changed using the one or more control symbols to establish a new range. The reference symbol may remain constant until the distance threshold would be exceeded or the reference symbol may be adjusted after each of individual compressed symbols is produced, in an attempt to bring the reference symbol to the center of the range. As a non-limiting example, using DPCM the sequence ‘ABBECCADWYAG″ may be replaced by ‘A1142203W2A6’ where the letters ‘A’ and ‘W’ represent the individual uncompressed symbols from the uncompressed data set and the digits ‘0’, ‘1’, 2’, ‘3’, 4, and ‘6’ represent the individual control symbols specifying the distance to the next symbol from the reference symbol that appeared that was most recently established. The letters ‘A’, and ‘W’ appear in the compressed data set to establish the reference symbol, either initially or because the distance to the next uncompressed symbol exceeds the distance threshold. The digits ‘0’, ‘1’, 2’, ‘3’, 4, and ‘6’ in this non-limiting example can be represented using only 4 bits, for an offset of +7 to −8, versus 8 bits or 16 bits required to represent the individual uncompressed symbol.
Lempel, Ziv, Welch (LZW) is a lossless compression algorithm that builds a dictionary that tracks sequences of symbols. As symbols are read from the uncompressed data set any identical sequence of symbols that is already in the dictionary is found up to the point where the dictionary pattern and the input pattern diverge. At that point, a code representing the matching portion of the pattern is passed to the compressed data set and the divergent symbol is added to the dictionary as an extension of the pattern that preceded it. LZW may be implemented using variable length codes to allow the dictionary to grow until the individual control symbol to reset the dictionary and start over is placed into the compressed data set. Under LZW, the decoder builds the same dictionary that the encode built as the compressed data set is produced and is therefore able to interpret the symbols in the compressed data set that represent sequences.
Huffman code is an optimal, variable-length prefix code that is commonly used for lossless compression. In a prefix code, no whole code word is a prefix for any of the other code words. During Huffman coding, a tree is constructed based upon the frequency of occurrence of each symbol such that the least commonly occurring symbols are deepest in the tree. The symbols are then replaced with codes such that the bits in the code represent the path through the tree from the root node to the node representing the symbol. The most commonly occurring symbols have the shortest paths and therefore the shortest codes.
GZIP refers to one of a number of implementations of file compression and decompression based upon Lempel-Ziv and Huffman codes. Like LZW, GZIP is effective at identifying previously occurring sequences of arbitrary length and encoding one or more uncompressed symbols as individual control symbols that reference previously observed sequences.
Throughout this document, the terms ‘code’ and ‘symbol’ may be used interchangeably to refer to a value that appears in a data set. Throughout this document, the terms ‘data set’ and ‘data file’ may be used interchangeably to refer to a collection of codes or symbols.