1. Technical Field
The present invention relates in general to computer-aided data compression, and in particular to a high throughput lossless compression technique.
2. Description of the Related Art
The ever increasing processing capacity of modern data processing systems has resulted in a corresponding need for improved data compression techniques. Data compression methods, most of which exploit the redundancy that is characteristic of data streams, have become an essential element of many high speed data communications and storage systems. Data compression is useful in computer-aided communications because it enables devices to transmit the same amount of data in fewer bits.
In communications, a transmitter can compress data before transmitting it, and a receiver can decompress the data after receiving it, thus increasing the effective data rate of the communications channel. In storage applications, data can be compressed before storage and decompressed upon retrieval, thereby increasing the effective capacity of a storage device. Data compression is widely utilized to reduce required storage capacity in backup utilities, spreadsheet applications, and database management systems. Other applications of data compression include magnetic and optical disk interfaces, satellite communications, computer network communications, interconnections between computers and attached devices, tape drive interfaces, and write-once media, where storage capacity is critical.
Compression techniques can be generally categorized as being either lossless or lossy. Lossless data compression transforms a body of data into a smaller body of data from which it is possible to exactly and uniquely recover the original data. Lossy data compression transforms a body of data into a smaller body of data from which an acceptable approximation of the originalxe2x80x94as defined by a selected fidelity criterionxe2x80x94can be reconstructed.
Lossless compression is required for applications in which all information in the original stream must be preserved. Such applications include transmission or storage of textual data, such as a printed language, programming language source code or object code, database information, numerical data, and electronic mail. Lossless compression is also utilized for devices such as disk drive controllers that must provide exact preservation and restoration of the object data.
Among the most common approaches to lossless data compression are textual substitution methods, in which frequently-appearing data strings are replaced by shorter indices or pointers that correspond to the data strings in a dictionary or a recent history of the input data stream. For dictionary-based implementations, an encoder module and decoder module are typically utilized to maintain identical copies of a dictionary containing data strings that have appeared in the input stream. The encoder finds matches between portions of the input stream and the previously-encountered strings stored in the dictionary. The encoder then transmits the dictionary index or pointer corresponding to the string in place of the matched string.
The encoder can also update the dictionary with an entry based on the current match and the current contents of the dictionary. If insufficient space is available in the dictionary, space is created by deleting strings from the dictionary. The decoder, operating in a converse manner, receives at its input a set of indices, retrieves each corresponding dictionary entry as a xe2x80x9ccurrent matchxe2x80x9d, and updates its dictionary. Because the encoder and decoder operate in a xe2x80x9clock-stepxe2x80x9d fashion to maintain identical dictionaries, no additional communication is necessary between the encoder and decoder. Thus, the input to the encoder is a stream of bytes or characters, and the output is a sequence of pointers. Conversely, the input to the decoder is a stream of pointers and the output is a stream of bytes or characters.
Implementing conventional dictionary-based compression techniques introduces additional system complexity required for handling the data storage requirements and switching among more than one dictionary. In addition, the efficiency of dictionary-based compression techniques suffers in contexts involving variable data structures and segment sizes.
As an alternative approach to dictionary-based lossless compression techniques, a recent history of the input data stream may be utilized to process and point to matching sequences within the history. The IBM adaptive lossless data compression (ALDC) family of products utilizes a derivative of Lempel-Ziv encoding to compress data. As a general compression technique, the Lempel-Ziv algorithm integrates well into systems required to handle many different data types. This algorithm processes a sequence of bytes by keeping a recent history of the bytes processed and pointing to matching sequences within the history. Compression is achieved by replacing matching byte sequences with a copy pointer and length code that together are smaller in size than the replaced byte sequence.
The performance of any given compression method can be evaluated in terms of compression efficiency, throughput, and latency. Factors affecting the performance of compression techniques such as ALDC in terms of compression efficiency and latency include data stream content and the size of the history buffer with respect to the amount of data to be compressed. As previously explained, ALDC and other LZ1-based compression algorithms rely upon a history buffer to perform compression by finding redundant strings. When the data being compressed is partitioned into small segments, and each compressed segment is required to be independently decompressible, a history buffer reset is required. Resetting the history buffer significantly reduces the compression ratio of a segmented data stream. As the segmentation of a given data stream increases (resulting in smaller segments) the compression ratio correspondingly decreases.
The size of the history buffer relative to the amount of data to be compressed is a limiting factor on the compression ratio of an ALDC compressor. Unmatched symbols are processed in raw form as literals with no compression technique applied thus reducing compression efficiency.
In addition to the above-mentioned performance affecting factors (i.e., data stream content and the size of the history buffer with respect to the amount of data to be compressed), the performance of ALDC in terms of throughput is ultimately limited by the bandwidth of the compression core device. The bandwidth limitation of conventional ALDC cores is one byte per cycle. This one byte per cycle limitation is set in accordance with the compression character increment, which for reasons of compression efficiency, is set to one character (ASCII, for example) per byte. As implemented within current application specific integrated circuit (ASIC) technology, the operational frequency of the core is limited to the system clock. Thus, regardless of data content and history buffer capacity, the maximum output throughput realizable for a conventional ALDC is dependent on the operational frequency of the system in which it is incorporated.
From the foregoing, it can be appreciated that a need exists for an improved adaptive lossless compression technique that reduces the loss of compression incident to resetting the history buffer, enables compression of character sequences not included within the history buffer, and has a compression processing throughput greater than one byte per cycle of the system clock. The present invention addresses such a need by providing a compression system that would improve lossless data compression throughput while minimally impacting compression efficiency.
A method and system for increasing compression efficiency of a lossless data compression utility are disclosed herein. The data compression utility compresses a segmented input data stream into independently decompressible data blocks, and includes a history buffer that maintains a history of the data stream. In accordance with the method of the present invention, a data segment is compressed utilizing a history buffer to identify repeated character sequences within the data segment. The method of processing a data segment entails updating the history buffer to include the data of the data segment while maintaining a preselected data set. The method of the present invention further includes compressing a data segment utilizing a history buffer to identify repeated character sequences within said data segment. As part of the compressing step an adaptable cache is searched for a matching character in response to identifying a next non-repeating character within the data segment. If a matching character is not found within the adaptable cache, a shadow cache is searched for a matching character. Characters found within the adaptable cache are identified in the compressed data stream through a cache reference. A further method to improve compression bandwidth processes data and searches the history buffer with two byte pairs.