The present disclosure relates to data compression using data transformation, such as Lempel-Ziv transformation, and encoding. Furthermore, the present disclosure relates to methods for generating an encoding table for symbols obtained by data transformation.
Safeguarding important data is usually performed by a data backup. To keep a historical representation of the backups, the data is generally backed up on removable data storage items, such as tape cartridges or the like. Usually, data backed up onto a storage medium needs to be compressed to save backup time and storage medium capacity.
Data compression is applied in various fields of information technology. For example, data compression is often applied for permanently storing data on a tape drive or the like. There is one standard established that is defined for tape drives, the so-called Linear Tape Open (LTO) standard, which provides a hardware compression scheme consisting of a Lempel-Ziv front end and a variable-length encoder back end. The back end encoder generates variable-length code words that are substantially used to encode the length of the matched strings in the history buffer. The data transformation generates symbols which can be used to reconstruct the original data stream.
The Linear Tape Open standard refers to the ECMA-321 for streaming lossless data compression. According thereto, the back end encoder allows specific extension of a particular source symbol to be used as a control symbol. In particular, according to the ECMA-321 specification, the control symbol is incorporated into the compressed data scheme to provide a command or a marker for controlling the decompression of the encrypted data stream.
In the ECMA-321 specification, the control symbol may correspond to a scheme 1 symbol, which indicates that the following data symbols are encoded according to a compression scheme. The compression scheme provides literals which correspond to unmatched data bytes and copy pointers which are an addressing representation of a data byte sequence matching a data byte sequence in a history buffer.
Furthermore, the control symbol may represent a scheme 2 symbol, which indicates that the following data sequence does not contain encoded data. The latter scheme might be useful if the data stream to be compressed has a high entropy, such that an efficient transformation to a set of copy pointers cannot be performed, i.e., after transforming the data stream to be compressed the encoded data stream would be longer than the original data sequence.
The back end encoding is usually performed as a kind of entropy encoding, wherein control symbols are encoded by maximum-length code words. According to the ECMA-321 specification, the total length of control symbols including a leading literal flag is 13 bits.
As the history buffer size tends to become larger in order to increase the compression ratio, efficient encoding of matched data streams in the history buffer requires variable-length code words that are longer than the control symbols. However, to provide a downward compatibility there is a need to keep the same control symbols that have been used up to now in devices applying data compression according to the ECMA-321 specification. Therefore, there is a need for designing compression schemes that incorporate control symbols of a given length.