1. Field of the Invention
This invention relates to data transformation, more particularly, to lossless data compression.
2. Background of the Invention
Existing compression technology focuses on finding and removing redundancy in the input binary data. Early compression approaches focused on the format of the data. These format approaches utilize run-length encoding (RLE) and variations of frequency mapping methods. These pattern-coding approaches work well for ASCII character data, but never reached the compression potential for other data formats.
Advances in compression technology evolved from information theory, particularly Claude Shannon""s work on information entropy. The bulk of this work is statistical in nature. Shannon-Fano and Huffman encoding build probability trees of symbols in descending order of their occurrence in the source data, allowing the generation of xe2x80x9cgoodxe2x80x9d variable-size codes. This is often referred to as entropy coding. Compression is accomplished because more frequently occurring binary patterns are assigned shorter codes, allowing for a reduction in the overall average of bits required for a message.
Shannon-Fano and Huffman encoding are optimal only when the probability of a pattern""s occurrence is a negative power of 2. These methods engendered a number of adaptive versions that optimize the probability trees as the data varies.
Arithmetic Coding overcame the negative power of 2 probabilities problem by assigning one (normally long) code to the entire data. This method reads the data, symbol by symbol, and appends bits to the output code each time more patterns are recognized.
The need for more efficiency in text encoding led to the development and evolution of dictionary encoding, typified by LZ family of algorithms developed by J. Ziv and A. Lempel. These methods spawned numerous variations. In these methods, strings of symbols (a dictionary) are built up as they are encountered, and then coded as tokens. Output is then a mix of an index and raw data.
As with entropy coding, dictionary methods can be static, or adaptive. Variants of the LZ family make use of different techniques to optimize the dictionary and its index. These techniques include: search buffers, look-ahead buffers, history buffers, sliding windows, hash tables, pointers, and circular queues. These techniques serve to reduce the bloat of seldom-used dictionary entries. The popularity of these methods is due to their simplicity, speed, reasonable compression rates, and low memory requirements.
Different types of information tend to create specific binary patterns. Redundancy or entropy compression methods are directly dependent upon symbolic data, and the inherent patterns that can be recognized, mapped, and reduced. As a result, different methods must be optimized for different types of information. The compression is as efficient as the method of modeling the underlying data. However, there are limits to the structures that can be mapped and reduced.
The redundancy-based methodologies are limited in application and/or performance. In general, entropy coding either compromises speed or compression when addressing the limited redundancy that can be efficiently removed. Typically, these methods have very low compression gain. The primary advantage is that entropy coding can be implemented to remain lossless.
Lossy compression can often be applied to diffuse data such as data representing speech, audio, image, and video. Lossy compression implies that the data cannot be reconstructed exactly. Certain applications can afford to lose data during compression and reconstitution because of the limitations of human auditory and visual systems in interpreting the information. Perceptual coding techniques are used to exploit these limitations of the human eyes and ears. A perceptual coding model followed by entropy encoding, which uses one of the previously discussed techniques, produces effective compression. However, a unique model (and entropy coder) is needed for each type of data because the requirements are so different. Further, the lossy nature of such compression techniques mean the results lose some fidelity, at times noticable, from the original, and make them unsuitable for many purposes.
Thus, a method for compression that is both lossless and capable of high compression gain is needed.
The present invention compresses binary data. The data is split into segments. Each of these segments has a numerical value. A transform, along with state information for that transform, is selected for each segment. The numerical value of the transform with its state information is equal to the numerical value of the segment. The transform, state information and packet overhead are packaged into a transform packet. The bit-length of the transform packet is compared to the bit-length of a segment packet that includes the raw segment and any necessary packet overhead. The packet with the smaller bit-length is chosen and stored or transmitted. After reception of the packets, or retrieval of the packets from storage, the numerical value of each segment is recalculated from the transform and state information, if necessary. The segments are recombined to reconstitute the original binary data.