Field of the Invention
The present invention generally relates to compression algorithms and more particularly to compression algorithms whose design accounts for the memory hardware used by the algorithm.
Background Description
Lempel Ziv (LZ) based compression encoders replace a repeating string in the input data stream with a pointer to a previous copy of the string in the compressed output stream. Pointers typically use fewer bits than the strings themselves, which is how the data compression is achieved (e.g. output becomes smaller than input.) Compression algorithms typically retain the most recently processed input data in order to discover the repeated strings. ALDC (Adaptive Lossless Data Compression) and ELDC (Embedded Lossless Data Compression) implementations of the LZ algorithm are described in the following references: J. Ziv and A. Lempel, “A universal algorithm for sequential data compression,” IEEE Trans. Inform. Theory, vol. IT-23, no. 3, pp. 337-343, 1977; D. J. Craft, “A fast hardware data compression algorithm and some algorithmic extensions,” IBM Journal of Research and Development, Volume 42 Issue 6, November 1998 Pages 733-745; M. J. Slattery and F. A. Kampf, “Design considerations for the ALDC cores,” IBM Journal of Research and Development, Volume 42 Issue 6, November 1998, Pages 747-752; and ECMA standard 222 “Adaptive Lossless Data Compression Algorithm,” which specifies a lossless compression algorithm to reduce the number of bytes required to represent data.
For example, the ALDC and ELDC implementations use a 16 KB history buffer; that is they retain the most recent 16 kilobytes of input data to search for repetitions in the input. Both algorithms are used in tape data storage systems. The history buffer may be referred as the “window” or “sliding window” in the literature. While we use ELDC, ALDC and a 16 KB history in the exemplary embodiment, the invention is applicable to all LZ based embodiments of data compression and for any size history buffers. The term “dictionary” refers to the information retained by the compression encoder while it searches for repetitions in input; for example, a dictionary may contain the first few bytes of an input string and a pointer or some indication as to where that string might be located in the history of an input stream. Different compression encoder implementations may use different size history buffers.
Lempel Ziv (LZ) based compression encoders implemented in hardware commonly use a Content Addressable Memory (CAM) for remembering the history of input phrases entered into the compression dictionary. For data compression purposes, a CAM detects if the current input byte matches any other bytes in the history buffer. The CAM provides all possible matches in the history buffer and their distances from the current input. As more input bytes arrive and at some point in time, the input stream will stop matching the history buffer. Then, the encoder will choose the longest matching string in the history and will replace the current input string with a pointer to the previous copy. Thus, a CAM is advantageous in finding the longest matching strings in the input. However, a CAM is typically very hardware intensive, in terms of silicon area and power consumption, thereby increasing the cost and complexity of hardware compression encoders.
Alternative to a CAM is a Static Random Access Memory (SRAM) based dictionary which uses hashing to store previously seen input phrases. An SRAM based dictionary is more efficient in terms of silicon area and power compared to a CAM. However, unlike a CAM, an SRAM based dictionary cannot detect all the matches in the input stream. Typically, only the most recent references to phrases in the history buffer may be retained in an SRAM based dictionary. Older references may be discarded from the dictionary due to lack of space or due to hash collisions (e.g. other phrases competing for the same location in the dictionary.)