An ideal transmission of an image over a digital network consists of the image being reduced to a minimum amount of information and faithfully reproduced at the receiving end without loss of detail. Although the image data can be compressed for transmission efficiency, the amount of compression is limited by practical concerns and by a theoretical limit. Source coding theory sets the limit for lossless data compression at the entropy rate, “S.” It is not possible to compress data—without data loss—using a compression rate that exceeds S. If some distortion can be tolerated, however, then “lossy” data compression using a rate-distortion function can provide a data compression rate that exceeds S, but the decompressed data is not exactly the same as the original data. In the case of an image, the tradeoff between a desirable data compression rate and the introduction of some distortion in the transmitted image may be acceptable as the human brain can compensate for many types of visual artifacts introduced into images by compression techniques.
The Moving Picture Experts Group (MPEG) has adopted various algorithms and standards for single image and video sequence digital data compression. MPEG compression is versatile because it is a composite or toolkit of compression techniques that work together to compress different aspects of an image or a video sequence. For example, an entropy transform known as discrete cosine transformation (DCT) performs transform coding: a spatial compression on each 8×8 pixel matrix composing an image; motion compensation performs a temporal compression on macroblocks consisting of four 8×8 pixel matrices; entropy coding performs statistical compression of coefficients resulting from the DCT; and quantization performs subjective compression of the DCT coefficients.
Consecutive frames of video are often very similar and hence contain approximately the same information, albeit, with slight changes that often result from motion being portrayed in the video sequence. As the number of frames or samples used to portray motion increases per unit time, the amount of change between frames decreases. Motion compensation attempts to find matched or unchanged areas common between frames. These “matches” are encoded via translation vectors. Since their composition is known, matched areas between a first frame and a second frame being predicted from the first frame are allocated a pointer, the translation vector, and removed from further prediction calculations. Once the matches have been removed, the frame (that the encoder is attempting to predict and/or encode) is often left with little or no information. This is called the residual frame. In macroblocks where prediction is being applied, the DCT is performed on the prediction errors instead of on the image itself.
Most video compression techniques rely heavily on motion compensation and residual encoding of the residual frame. Often, the aforementioned matches are not exact and there is “leftover” information in the predicted frame (the one that the encoder is encoding) that still needs to be encoded. A typical residual frame looks “almost blank” with pockets of energy that represents the “errors” in the matches (prediction error). During transform coding, these errors are operated on by the DCT, converting the errors into the frequency domain. The frequency information is then compressed via entropy coding called variable length coding or Huffman encoding. Huffman codes are widely used to convert a string of data to tokens, each having a length that is inversely proportional to the frequency-of-use of the encoded character. For example, to transmit Huffman-encoded English language text, a token for the letter “e” is allotted very few bits, because “e” is the most common character in the alphabet. In MPEG compression, the Huffman type entropy coding usually includes several variable length code tables available to a decoder.
Before Huffman entropy coding, prediction errors are first passed through the DCT transform coding stage in order to reduce the number of non-zero terms. Even though energy pockets (the visual information that did not exactly match during prediction between frames) are found throughout the residual frame, the frequency content is limited and hence by converting the residual frame into the frequency domain, an encoder can reduce the number of non-zero elements, which leads to better packing, i.e., compression.
A complete frame of an image is typically divided into 8×8 “blocks” for transform coding. The DCT converts small blocks of an image (transforming the entire image at once would be to complex) from the spatial domain into the frequency domain, as mentioned. The DCT represents a visual block of image pixels as a matrix of coefficients. For example, the color values used in an image are approximated by coefficients using a sum of cosine functions. Thus, instead of representing visual data spatially as a set of 64 values arrayed in an 8×8 matrix, transform coding using DCT represents the visual data as a varying signal approximated by a set of 64 cosine functions with respective amplitudes. Desirable compression rates result if many of these 64 amplitudes equal zero.
The first horizontal line of DCT coefficient in a matrix describes horizontal spatial frequencies, those in the first vertical column describe vertical spatial frequencies, and the other DCT coefficients in a matrix describe diagonal components. Since different spatial frequencies have a different impact on human perception of an image, it should be noted that the DCT is also important for applying subjective compression as well as purely spatial compression.
DCT coded blocks are excellent starting material for an MPEG quantization compression step because after DCT coefficients are coarsely quantized an inverse DCT of the quantized coefficients does not noticeably degrade the resulting image. Coarse quantization discards image detail information: the compression is accomplished by reducing the numbers of bits used to describe each pixel, rather than reducing the number of pixels as in sub-sampling techniques. Each pixel is reassigned an alternative value and the number of allowed or possible alternative values is less than the number present in the original image. In a grey-scale image, for example, the number of shades of grey that pixels can have is reduced, i.e., fewer greys are used and the greys have wider ranges into which each pixel must be fitted. Quantization where the number of ranges is small is known as coarse quantization.
The DCT, which provides frequency information for the Huffman coding and the quantization, works well (i.e., takes a large image and outputs a relatively small set of numbers that can represent the image in the frequency domain) if the residual image is “smooth.” The smoothness of an image is important to data compression. Since human perception notices a large object more than tiny details within the large object, low spatial frequency information is more important to retain during data compression than high spatial frequency information. Several steps of an MPEG set of compression techniques may filter and discard the high spatial frequency information as required by bandwidth limitations.
Cosine functions as used in the DCT are inherently smooth periodic functions, deriving from properties of smoothly changing periodic (circular or oscillatory) motion. Thus, DCT techniques work best with images that have smooth color and brightness changes between and/or across small areas, that is, across adjacent pixels. In other words, images with many sharp edges (a large quantity of sharp, small-scale detail that is not redundant across the image) are more difficult to compress: there is simply more visual information represented in the image, and proportionately more data needed to faithfully represent the image. These small, sharp visual details are difficult to “fit” to an inherently smooth cosine function. Fortunately, in many video sequences, much of the type of detail is extraneous, random noise that is not part of the video sequence and can be removed.
Artifacts can be unwittingly introduced in a video sequence when the camera moves, when the focus changes, etc. and when other “mistakes” occur, such as subtle changes in the lighting of a scene over time. Since these artifacts are subtle, they appear as high variance noise included in the residual frame that is the starting material for the DCT, and result in a great deal of high frequency energy in the DCT output. The high frequency energy is undesirable for attaining favorable data compression.
Even when high spatial frequency detail is not present as noise—the image may just have a lot of detail, movement, and resulting high frequency error—the high spatial frequency detail can often be left out without noticeable degradation. A visual presentation is often improved by removing “molecularly” precise detail—i.e., a too small-scale faithfulness to detail can appear flawed to the eye. Thus, in the quantization compression step or when an image is decompressed a filter may be used to remove some of the detail. To recover the original detail once high spatial frequency information has been discarded in favor of a higher data compression rate, however, is impossible if the data has been discarded, i.e., if an image is smoothed by having detail discarded and then compressed and transmitted, a decoder at the receiving end cannot regenerate the original detail since it has been irreversibly discarded.