Contemporary data processing activities often produce, manipulate, or consume large quantities of data. Storing and transferring this data can be a challenging undertaking. One approach that is frequently productive is to compress the data so that it consumes less space. Data compression algorithms identify redundant or inefficiently-coded information in an input data stream and re-encode it to be smaller (i.e., to be represented by fewer bits). Various types of input data may have different characteristics, so that a compression algorithm that works well for one type of data may not achieve a comparable compression ratio (the ratio between the uncompressed and compressed data sizes) when processing another type of data.
No known compression algorithm achieves the best results for every data type; there is always an input data stream that an algorithm simply cannot make any smaller, though there is often a different algorithm that could re-encode the same data stream in a smaller number of bits. Sometimes, an algorithm operates in a way that both compresses a data stream and exposes additional redundancy or inefficient coding, so that a second compression stage could shrink the information even further. The design of an effective, general-purpose data compressor often involves trade-offs between the compression ratio and the number of stages (more stages typically increase compression and decompression processing time).
FIG. 2 shows how a popular and effective data compression algorithm works. The LZSS algorithm, named after its creators James Storer and Thomas Szymanski (who built on work by Abraham Lempel and Jacob Ziv), compresses a sequence of data symbols (e.g., data bytes) by identifying repeated sequences of symbols in the input, and replacing the sequences with smaller symbols. To compress the word “acacia,” 210, an LZSS encoder 220 proceeds symbol by symbol (i.e., letter by letter), and produces the compressed sequence shown at 230. Reading from top to bottom, the compressed sequence contains a flag 231 that indicates what sort of information follows the flag. In the version of the LZSS algorithm depicted here, a flag value of 0 means that the following element 232 is a “literal,” that is, it is exactly the same as the corresponding input symbol 212. The next flag 233 is also 0, and is followed by literal 234 (corresponding to input symbol 214). After processing two input symbols, the LZSS encoder 220 has increased the size of the output stream by two bits (the flag bits 231 and 233). However, LZSS encoder 220 next encounters symbols 218, the letters “ac,” which are the same as the first two letters. Consequently, the encoder emits flag 235 (value 1), followed by an offset-length pair 236 that indicates a repetition of the two symbols located at offset 0. Compression is achieved if the offset and length information, plus the three flag bits, occupy less space than the first four input symbols. An LZSS implementation can adjust the number of bits allocated to offsets and lengths (among other parameters) to obtain satisfactory compression performance. (Typically, compression algorithms have poor performance on very short input streams, so the example discussed here should not be taken as indicative of LZSS's potential performance, but only of its general operational principles.)
Improvements to the generic LZSS algorithm described with reference to FIG. 2 may be useful and widely applicable.