This invention relates generally to data encoding and decoding techniques, and, more particularly, to systems and methods for performing data compression and decompression.
Data compression generally involves converting data to a format that requires fewer data units (e.g., bits). Data compression is widely used to compress various types of computer files. For example, ARC and ZIP are two popular file compression formats. Data compression is also used in backup utilities, spreadsheet applications, and database management systems.
Data compression is particularly useful in communications because it enables devices to transmit the same amount of information using fewer data units. For example, common facsimile (fax) machines translate pictures and text into bitmaps, and transmit the bitmaps via standard telephone lines. The CCITT (Comitxc3xa9 Consultatif International Txc3xa9lxc3xa9phonique et Txc3xa9lxc3xa9graphique), an organization that sets international communications standards, has defined the data compression technique used for transmitting faxes (the Group 3 standard CCITT T.4). The CCITT has also defined several data compression standards for data communications via modems, including the CCITT V.42bis standard, a data compression protocol that enables modems to achieve a data transfer rate of 34,000 bits per second.
In 1977, Abraham Lempel and Jacob Ziv proposed a class of data compression algorithms called the Lempel-Ziv 1 (LZxe2x80x941) class. (See xe2x80x9cA Universal Algorithm for Sequential Data Compressionxe2x80x9d by J. Ziv and A. Lempel, IEEE Trans. Inform. Theory, Vol. IT-23, No. 3, pp. 337-343, 1977). The LZxe2x80x941 class of compression algorithms is widely used for data such as executable program instructions (i.e., code) modules, electronic documents, financial data, and other types of data in situations where the original information must be recovered exactly.
In general, LZxe2x80x941 algorithms achieve compression by building and maintaining a data structure called a history buffer. An encoder implements an LZxe2x80x941 encoding or compression process, and a decoder implements an LZxe2x80x941 decoding or decompression process. The encoder and the decoder each have substantially identical history buffers. The history buffers are first initialized to the same known state (e.g., a cleared empty state), then updated in a similar fashion. The encoder receives an input data stream for compression, and uses the input data stream to update the encoder history buffer. The decoder generates an identical data stream as output, and also uses this output data stream to update the decoder history buffer. As a result, the contents of the two history buffers again become identical.
As the encoder history buffer fills, it becomes increasingly possible for the encoder to represent incoming data by encoding reference information identifying a sequence of data (i.e., a xe2x80x9cmatching stringxe2x80x9d) already present in the encoder history buffer. In general, the encoder attempts to find the longest possible sequences of data (i.e., matching strings) stored in the encoder history buffer. When the reference information can be encoded in fewer bits of information than the data string it identifies, compression is achieved. This is the principal mechanism by which LZxe2x80x941 algorithms achieve compression.
An LZxe2x80x941 encoder typically processes incoming data one byte at a time, and copies each byte into the encoder history buffer as it is encoded. When the encoder history buffer is full, the next input data byte typically displaces (i.e., overwrites) the oldest data byte in the encoder history buffer. The encoder history buffer may thus be thought of as a xe2x80x9csliding windowxe2x80x9d passing through the input data stream. For this reason, the LZxe2x80x941 class of data compression algorithms is sometimes referred to as xe2x80x9csliding windowxe2x80x9d data compression.
Detailed encoding strategies vary, but in general, if the input data matches a 2-byte or longer string found in the encoder history buffer, the encoder will encode the input as a xe2x80x9ccopy pointer.xe2x80x9d Typically, a copy pointer includes a leading xe2x80x981xe2x80x99 identifier bit, followed by a count field indicating the length of the data match, followed by a displacement field indicating the current location of the string in the encoder history buffer.
If an input data byte does not match a data byte in the encoder history buffer, the encoder encodes the data byte as a literal. A literal typically includes a leading xe2x80x980xe2x80x99 identifier bit, followed by the data (e.g., 8 bits of a data byte). When a given input data byte matches a data byte in the encoder history buffer, but the next input data byte does not match the next data byte in the encoder history buffer, the encoder also encodes the given input data byte as a literal.
The compressed data stream produced by the encoder thus includes an arbitrary sequence of literals and copy pointers, distinguished by their respective identifier bits. An LZxe2x80x941 decoder receives a compressed data stream produced by an encoder. The decoder typically decodes literals by simply removing the leading xe2x80x9c0xe2x80x9d bit, producing the data as output, and storing a copy of the data in the decoder history buffer. The decoder decodes copy pointers by extracting a count value from the count field and a displacement value from the displacement field. The decoder uses the count value and the displacement value to identify the corresponding string in the decoder history buffer. The decoder copies the corresponding string from the decoder history buffer to the output, and also stores the corresponding string byte sequentially in the decoder history buffer, so when the operation is complete the history buffers are in step once more.
The Adaptive Lossless Data Compression (ALDC) data compression algorithm is one variant of the LZxe2x80x941 class of data compression algorithms. The ALDC algorithm is an internationally accepted ANSI (American National Standards Institute), ECMA (formerly known as the European Computer Manufacturers Association), and QIC (Quarter-Inch Cartridge Drive Standards, Inc.) standard. (See, for example, standard QIC-154, xe2x80x9cAdaptive Lossless Data Compression (ALDC)xe2x80x9d, currently available on the World Wide Web at www.qic.org).
The ALDC algorithm uses a variable-length count field. For example, a 2-bit count field can indicate matching data string lengths of 2 or 3 bytes. A binary displacement field is used to encode the displacement value, 10 bits being needed for a 1024-byte history buffer size. Encoding a 2-byte (16 bit) matching string using a 13-bit copy pointer (with a leading xe2x80x981xe2x80x99 bit, a 2-bit count filed, and a 10-bit displacement field) achieves a compression of 16/13 to 1. Encoding a 3-byte (24 bit) matching string using the 13-bit copy pointer compresses 24/13 to 1. Encoding longer strings requires larger count fields, but even so, such encodings are capable of producing even greater compressions.
An encoder for compressing data is described, including a history buffer having multiple storage locations for storing target data units. The encoder is configured to input a target data string including multiple target data units, and additional data (e.g., security data associated with the target data string). In the event the history buffer contains multiple matching strings that match the target data string, the encoder is configured to select a displacement value of one of the matching strings dependent on a portion of the additional data to be included, and to produce a copy pointer corresponding to the target data string and including the selected displacement such that the selected displacement value in the copy pointer conveys the portion of the additional data.
A decoder for decompressing data is also described, including a history buffer having multiple storage locations for storing data units. The decoder is configured to input a copy pointer including a displacement value indicative of a location of a first data unit of a data string in the history buffer, to use the displacement value to determine the data string, and to use the displacement value to determine a value of a quantity of additional data conveyed by the displacement value in the event multiple copies of the data string exist in the history buffer. A system is described including the encoder and the decoder.
A method for compressing data is described including inputting a target data string including multiple target data units, and additional data. The following are performed in the event multiple matching strings of data units stored in a history buffer match the target data string: selecting a displacement value of one of the multiple matching strings dependent on a portion of the additional data, and producing a copy pointer corresponding to the target data string and including the selected displacement value such that the displacement value conveys the portion of the additional data.
A method for decompressing data is described including inputting a copy pointer having a displacement value indicative of a location of a first data unit of a data string in a history buffer. The displacement value is used to determine the data string. In the event multiple copies of the data string exist in the history buffer, the displacement value is used to determine a value of a quantity of additional data conveyed by the displacement value.