Data compression is used in a variety of programming contexts to provide increased speeds in data transfers. Typical compression algorithms review the entire data set and attempt to remove redundancy from the data set. Redundancy can take the form of repeated bit sequences, repeated byte sequences, as well as other forms of repeated sequences.
Data compression algorithms can be generally characterized as lossless or lossy. Lossless compression involves the transformation of a data set such that an exact reproduction of the data set can be retrieved by applying a decompression transformation. Lossless compression is most often used to compact data, where an exact replica is required. Lossy compression cannot be used to generate an exact reproduction, but can be used to generate a fair representation of the original data set through decompression. Lossy compression techniques are often used for images, sound-files, and video, where the loss errors are generally imperceptible to human observers.
White-space compression is one type of commonly used lossy compression scheme. An example white-space compression scheme is to remove all of the indentation and vertical spacing in an HTML document that is destined for a web browser. Since the HTML document is destined to a web browser, formatting of the document is handled by the browser and removal of the white spaces has no noticeable effect. After white-space compression, HTML document can be transmitted faster, utilizing less storage space.
Run-length encoding (RLE) is a simple lossless compression technique. The main idea in RLE compression is that many data representations consist largely of strings that repeat. The number of times the string repeats is described by a number, followed by the string itself. For data sets that many repeated characters RLE compression provides acceptable performance.
Huffman encoding is a lossless compression technique that takes a block of input characters of a fixed length and produces a block of output bits of variable length. The basic idea of Huffman coding is to assign short code words to input blocks that have high probabilities and long code words to input blocks that have low probabilities. Huffman coding is accomplished by creating a symbol table for a data set, determining the frequency of the symbol occurrence, and code values to each symbol in the data set based on the frequency. Although the coding process is slower, the decoding process for Huffman coded data is of a similar speed to RLE decompression.
Lempel-Ziv compression (LZ77) is an adaptive dictionary-based lossless compression algorithm. The compression algorithm maintains a list of all substrings that have been found so far in the data stream. At any given point in the input string, the longest substring starting at that point that matches one of the stored strings is located. The repeated substring can thus be replaced by a pointer to the original, saving space. The data structure is a window that slides through the data set.
Hypertext Markup Language (HTML) is used in most Web pages, and forms the framework where the rest of the page appears (e.g., images, objects, etc). Unlike images such as GIF, JPEG, and PNG, which are already compressed, HTML is ASCII text. Since ASCII text is highly compressible, compressing HTML can have a major impact on the data transfer rate for web pages. A compressed HTML page appears to pop onto the screen, especially over slower transfer mediums such as dial-up modems.
Most modern web browsers include a built-in decompression algorithm that is compatible with GZIP. GZIP is a command-line file compression utility distributed under the GNU public license. GZIP is a lossless compressed data format that is available as an open-source variant of the Lempel-Ziv compression. GZIP compression is accomplished by finding duplicate strings in the input data stream. The second occurrence of a string is replaced by a pointer to the previous string, in the form of a pair (distance, length). Distances are limited to 32 K bytes, and lengths are limited to 258 bytes. When a string does not occur anywhere in the previous 32 K bytes, it is emitted as a sequence of literal bytes.