Data compression is an essential component of many applications and therefore much attention has been given to the problem of improving data compression. The internet requires data compression to put images, audio and video on websites in a practical embodiment. Digital TV, satellite TV and the recording of movies on DVD likewise require compression. The JPEG and MPEG image standards use lossy data compression to provide represent an image or video. Recently announced programs by Google to scan the entire research libraries of Harvard, Stanford and the New York Public Library show a trend for huge image and text databases that will require compression if they are to maintained on reasonable amounts of physical storage media or to be efficiently accessed. Transmission of image files between computers using email attachments or between cell phones also benefits from compression of the files. Such programs as ZIP provide lossless compression for the individual user.
Compression system may be divided into lossy and lossless systems, the lossless systems being those where the original file can be exactly reconstructed when compression is reversed. The most common methods of compression are Huffman coding, Arithmetic coding, PPM (Prediction with Partial Match), Markov coding, RLE (Run Length Encoding), and Multi-media compressions such as JPEG/MPEG. In this context coding includes the assignment of binary sequences to elements being encoded.
Huffman coding is a lossless entropy encoding algorithm that finds the optimal system of encoding strings based on the relative frequency of each character. Huffman coding uses a specific method for choosing the representations for each symbol, resulting in a prefix-free code (that is, no bit string of any symbol is a prefix of the bit string of any other symbol) that expresses the most common characters in the shortest way possible. It has been proven that when the actual symbol frequencies agree with those used to create the code, Huffman coding is the most effective compression method of this type: no other mapping of source symbols to strings of bits will produce a smaller output.
Arithmetic encoding avoids a problem with Huffman coding, namely the need for codewords for all possible sequences of a given length in order to encode a particular sequence of that length. Arithmetic encoding assigns a unique tag to each distinct sequence of symbols by using the cumulative distribution function to map the sequence of symbols into values in the unit interval.
PPM is a context-based algorithm which uses the context of a symbol to estimate the probability of its value. The probability is estimated as the coding proceeds as opposed to estimating and storing a large number of conditional probabilities in advance of coding. One parameter in a PPM encoding scheme is the maximum length for a context. Another parameter is the count assigned to the escape symbol that indicates that a symbol to be encoded has not previously been encountered with a context.
Markov encoding relies upon the last few samples of a process to predict the probability of the next symbol and is a form of predictive encoding. One parameter in such encoding is the number of samples that are considered sufficient for purposes of the prediction.
RLE encoding codes the lengths of runs of a particular pixel rather than coding individual values.
MPEG and JPEG achieve high compression rates of video images by storing only the changes from one frame to another, instead of each entire frame. The video information is then encoded using a technique called DCT (Discrete Cosine Transformation). MPEG/JPEG uses a type of lossy compression, since some data is removed. But the diminishment of data is generally imperceptible to the human eye. The lossy data is then encoded in a Huffman encoding scheme. The result is lossy compression.
Lossy systems make determined sacrifices of data which are deemed not essential. A lossy system for audio transmission of music may, for example, dispense with data that records frequencies of sound beyond the ability of the intended reproduction medium. A lossy system for video transmission of images may, for example dispense with data that records color differences too subtle for an intended reproduction medium. Lossless systems are generally more desirable because they enable complete reproduction of the original without the losses that some programmer assumed would be tolerable; lossy systems are in effect merely a compromise to permit effective compression.
There are criterion that should be met in order for any of these compressions to work efficiently. Typically, there needs to be succession runs of similar information data elements or elements that have been mapped to a different code source. Lossy compression techniques truncate information by using association, quantization, or simply by only encoding information in a set boundary. In most cases this may be acceptable because the data is not imperative to the application or source and can therefore be cut out. Each of the compression methods discussed makes assumptions about the data to be encoded and has parametric values that may be adjusted to specify a specific implementation of the encoding algorithm more suitable to particular data.
In general it is known to evaluate the efficiency of a compression scheme by comparing the bit length required to encode data with the entropy of the data in the particular scheme. If {X1, X2, . . . , Xn} is a sequence of length n from a source having m different characters, letGn=−Sum P(X1, X2, . . . , Xn)log P(X1, X2, . . . , Xn)where P(X1, X2, . . . , Xn) represents the probability of finding in the data particular values for X1, X2, etc. and the sum is over all possible particular values. Then the entropy H is defined asH=lim(1/n)Gn,where the limit is as n approaches infinity.
For independently identically distributed elements in the sequence this is the same asH=−Sum P(X1)log P(X1).
It would be desirable to attempt different transformations of particular data on the fly by adjusting the parameters that define a particular encoding scheme or transformation and selecting the one that is optimal. Lossy compression transformations may be characterized by arbitrary parameters, which are usually chosen to achieve desired compression ratios. Arbitrary small changes of these parameters are permitted by lossy compression algorithms. Thus lossy encoding schemes allow continuous variation of their defining parameters, with continuously varying results in efficiency. However such flexibility does not exist for known lossless encoding schemes, which do not produce comparable slowly varying coding efficiencies when their defining parameters are slowly varied. The requirement of compression to be lossless usually leads to very strict limitations, which do not permit one to use continuously adjustable parameters. For example, the LZW algorithm does not permit any continuously variable parameters.
What is desired are methods for lossless data compression which allows the adjustment of parameters during encoding and thus the optimization of compression. In particular such a method is desired for the encoding on a device having a digital processor of raster images and their subsequent decoding.
What is also needed are methods for combining lossy and lossless data compression into overall lossless methods that have continuously variable parameters that permit the improvement or optimization of the compression process based upon trials with the particular data being compressed.