Data compression occurs in a number of contexts. It is very commonly used in communications and computer networking to store, transmit, and reproduce information efficiently. It finds particular application in the encoding of images, audio and video. Video presents a significant challenge to data compression because of the large amount of data required for each video frame and the speed with which encoding and decoding often needs to occur. The current state-of-the-art for video encoding is the ITU-T H.264/AVC video coding standard. It defines a number of different profiles for different applications, including the Main profile, Baseline profile and others. A next-generation video encoding standard is currently under development through a joint initiative of MPEG-ITU termed High Efficiency Video Coding (HEVC).
There are a number of standards for encoding/decoding images and videos, including H.264, that use block-based coding processes. In these processes, the image or frame is divided into blocks, typically 4×4 or 8×8, although non-square blocks may be used in some cases, and the blocks are spectrally transformed into coefficients, quantized, and entropy encoded. In many cases, the data being transformed is not the actual pixel data, but is residual data following a prediction operation. In video coding, predictions can be intra, i.e. based on one or more reconstructed pixels within the same frame/image, or inter, i.e. based on reconstructed pixels of a previously-coded picture or image (also called motion prediction).
After a prediction block is generated, it is subtracted from the original block, leaving a residual block. The residual block is transformed to the frequency domain (often using DCT) to produce a block of transform domain coefficients, which are then quantized. The quantized transform domain coefficients are entropy coded and output as a bitstream of encoded data.
Most coding schemes attempt to balance distortion in a reconstructed picture with the bit rate. The quantization operation introduces distortion. With larger quantization step sizes comes larger distortion, but conversely larger quantization step sizes lead to smaller quantized coefficients and, as a result, a lower bit rate. The simplest quantizer uses the same quantization step size for all coefficients in a picture or image.
The human visual system does not have the same sensitivity to all distortion. For example, humans are more sensitive to distortion in lower frequency components than to distortion in higher frequency components. The measure of distortion most commonly used is peak signal-to-noise ratio (PSNR), which measures the mean squared error between spatial domain pixels in the reconstructed picture versus the original picture. This is not necessarily an accurate representation of human sensitivity to distortion.
Work on human perception of video distortion has led to the development of various measurements of “structural similarity” (SSIM) between an original picture and its reconstruction, which may be a better representation of human perception of error than PSNR. A structural similarity metric may take into account the mean values of the two pictures (or a window or block of pixels), the variance within each of those pictures and the covariance of those two pictures. SSIM may, therefore, be useful in making coding decisions, including the level of quantization to apply to a particular set of pixel data. Actual structural similarity metrics may be complex to calculate and may require multiple passes due to the necessity of calculating mean and variance values for a whole picture or grouping of pixels. This may introduce unacceptable delay and/or computational burden. Nonetheless, it would be advantageous to be able to adapt the quantization of coefficients to local statistics of the data.
Similar reference numerals may have been used in different figures to denote similar components.