1. General Background
In image and video coding, compression efficiency is achieved by exploiting spatial redundancies in input images, to produce a file or bit-stream representing the images using fewer bits than in the uncompressed images. In video coding, both spatial and temporal redundancies are used. Intra coding are coding techniques using only spatial redundancy, i.e., pixels in a block of the current image exploit redundancy from previously-decoded pixels in the same image. Temporal redundancy is leveraged in inter coding, where pixel data from previously-decoded images, other than the current image are used in the compression process.
In the High Efficiency Video Coding (HEVC) standard and other common standards, the encoder or decoder make decisions on modes and parameters based on a cost function or metric. The cost typically considers the rate, or number of bits needed to code the image or video using a given mode or parameter, and a distortion metric, typically a mean-square error (MSE) or a mean-absolute error.
The distortion or error metrics is typically determined on a pixel-by-pixel basis, e.g., a difference of pixels between two blocks of data. The distortion metrics, however, do not fully consider aspects of human perception that affect the visibility of the distortion. For example, the human visual system is less sensitive to distortion in very bright or very dark areas of an image as compared to distortion in regions of an image that have a medium level of brightness. Taking the mean-square error between two blocks of pixels does not take advantage of these aspects of human perception.
Image and video schemes that incorporate human perception have typically adjusted how quantization is applied after a block of the image undergoes a transformation. For example, a Discrete Cosine Transform (DCT) may be applied to a block of pixels, and then the resulting transform coefficients are quantized using quantizers with different levels of coarseness. Variations in very high frequencies are often less perceptible to the human visual system than are the lower frequencies, so high-frequency coefficients may be quantized more coarsely than coefficients corresponding to smooth areas of a block. Similarly, very bright or dark areas may be quantized more coarsely than medium-intensity areas of an image.
HEVC and similar coding systems use prediction to achieve high compression efficiency. Currently, the various prediction modes available to the codec are selected based on minimizing a cost that includes mean-square or absolute errors between blocks of pixels.
To the best of our knowledge, perceptual techniques have not used for computing a prediction.
The result of this non-perceptual prediction is that the predicted block or image does not necessarily appear to resemble the block or image that is being predicted. For example, the predicted image may appear very blurred, or the image may appear to be a combination of disjoint blocks with various directional features or varying intensities.
Because metrics such as mean-square or mean-absolute error do not always correlate with perceptual quality, there is a need for metrics that better incorporate human perception throughout the coding system. In addition to perceptual techniques being used during quantization, there is also a need for such metrics to be used in the prediction process. Methods for jointly optimizing the prediction and quantization process do not presently exist, and are therefore needed as well.
In addition to perceptually-based metrics, the prediction process itself can be performed in a way that considers human perception. For example, to predict a given block of pixels, the current predictor in HEVC interpolates or copies adjacent pixels from neighboring blocks. The standard does not consider pixels that are not immediately adjacent to the current block, so perceptual or structural similarities among the current block and previously-decoded blocks cannot be leveraged.
2. Detailed Background
Therefore, image and video coding is one of the most critical techniques in modern multimedia signal processing. The state-of-the-art image codecs such as JPEG 2000 and JPEG XR, and video codecs such as H.264/AVC and the proposed High Efficiency Video Coding (HEVC) standard can compress image or video at considerably low bit rates with good quality. However, the quality and mode-dependent decisions inside the codecs are typically measured using mean-square error, which is related to the peak signal-to-noise ratio (PSNR), or mean-absolute error, which are well known metrics that do not necessarily related to the ultimate signal receiver, i.e., human vision system (HVS). For example, a viewer may tolerate more distortion on a dark object in a dark background region than a brighter object on a medium-intensity background. By exploiting the characteristics of human perceptual system, redundant information can be discarded without noticeable distortion, and thus the bit-rate of the coded signal can be significantly reduced. Spatial or frequency component correlations are widely considered with techniques such as motion compensation, template matching, and adaptive prediction, but not perceptually optimized coders. Additionally, the distortion that cannot be perceived by the HVS is usually not measured objectively.
To study HVS, subjective tests, using simple excitation such as uniform luminance blocks, sinusoid gratings and Gabor patches, can be used to determine the detection threshold of distortion or of the signal. These experimental results are related to Just-Noticeable-Distortion (JND) and the results are modeled mathematically in order to be used in the image and video codecs. Theoretically, as long as the distortion or signal level is below JND, it should not be perceived by the HVS (perceptually lossless). In an image or video coding context, ideally the coder only allocates bits for signaling portions of the image for which the distortion that greater or equal to JND, or the so-called Supra-threshold Distortion (StD). JND as used herein is a term of art related to the work of Ernst Heinrich Weber and Gustav Fechner (˜1850).
Human Vision System and Just Noticeable Distortion
The HVS is a very complex, much is not understood. At a lower level, the HVS is known to perform a sub-band decomposition. Also the HVS does not consider different visual information, e.g., intensity and frequency, as having the same importance. Psychophysics studies shows four aspects affect the detection threshold of distortion (or signal) in HVS. They are luminance adaptation, contrast sensitivity, contrast masking and temporal masking.
Luminance adaptation indicates the nonlinear relationship between perceived luminance and true luminance displayed. Luminance adaptation is also called luminance masking because the luminance of the signal masks the distortion. The luminance adaptation is usually tested by showing a patch of uniform luminance as the excitation against the background which has different luminance. The detection sensitivity is modeled by the Weber-Fechner Law, such that when the excitation is just noticeable, the luminance difference between the patch and background divided by luminance of background is a constant. In other words, the brighter the background is, the higher the detection threshold will be, meaning that the sensitivity to distortion is lower. However, due to the ambient illumination of many display devices, the masking in very dark regions is stronger than that in very bright regions.
Contrast sensitivity refers to the reciprocal of the contrast detecting threshold, which is the lowest contrast at which the viewer can just barely detect the difference between the single frequency component, a sinusoidal grating, and the uniform luminance background. Here, contrast means the peak-to-peak amplitude of sinusoidal grating. The sensitivity varies in depending upon the background luminance. Experiment have shown sinusoids of light for different wavelengths (red, green and blue) and different luminance levels to human viewers. When the background luminance is relatively low, e.g., <300 photopic Trolands (Td), the detection threshold obeys de Vries-Rose law with respect to frequency, in which the threshold increases in proportion to the reciprocal of the square root of the luminance. When the background luminance is high (>300 Td), then the detection threshold follows the Weber-Fechner law. The contrast sensitivity function has important impact for perceptual image or video coding.
Contrast masking is the effect of reducing the perceivability of the distortion (or signal) by the presence of a masking signal. For example, many coding artifacts in the complex regions, such as tree leaves and sand, are less visible than those in the uniform regions such as the sky. In this case, the high spatial-frequency components in complex regions mask the high spatial-frequency components in the artifacts. The masker usually has a similar spatial location and spatial-frequency components as the distortion (or signal). Therefore, contrast masking is sometimes called texture masking. When contrast masking is analyzed with sinusoidal gratings having different frequencies and grate widths, the results show that the detection threshold for the high contrast masker follows a power law, and the low contrast masker reduces the detection threshold. By quantitatively measuring the detection threshold for different background luminance of varieties of subjects, the threshold modulation curves, namely the adjusted contrast sensitivity function (CSF) can be plotted.
Temporal masking in video coding refers to the reduction in the perception of distortion (or signal) in the presence of high temporal frequencies. When the motion is fast, details in individual video frames is more difficult to detect. In addition to depending upon the temporal frequency, temporal masking also depends upon a function of spatial frequency. The sensitivity is modeled as a band-pass filter at low spatial frequencies and a low-pass filter at high spatial frequencies.
Perceptual Quality Metrics
In addition to JND, another important technique that can improve perceptual image or video coding is the quality metric which approximates subjective characteristics of a viewer. Instead of considering quality only from the Signal-to-Noise Ratio (SNR) point of view, many metrics based on the nature of HVS are known. The model-based perceptual quality metrics use the idea of filter banks similar to HVS, such as the Cortex Transform, Gabor filters, and steerable filters to decompose the image into sub-bands and analyze the perceptual characteristics to quantitatively determine the quality. A good example of a model-based metric is the visible difference predictor (VDP). Because model-based metrics are computationally complex, signal-driven metrics, which do not try to build a HVS model, are preferred. Signal-driven metrics extract only the characteristics or statistics of a signal for evaluation purposes.
Within the category of signal-driven metrics, structural approaches show great success in image processing. The quality is measured in the sense of structural similarity between the original image and the coded version. A good coder can reconstruct totally different images or image blocks in the MSE sense, without effecting the viewing quality. The issue with the MSE or PSNR is that images are strongly locally correlated 2D signals, which information beyond single pixel intensity levels. In fact, the information includes shapes, patterns, colors, edges, and so forth. A good metric should maximize similarities, and be invariant to translation, scaling and rotation. Moreover, the metric should also invariant to light intensity and chroma changes.
A well-known structural similarity metric is SSIM, see Appendix for definitions used herein. SIMM can be used for the spatial domain and applied to sub-bands. The metric SSIM(x, y) between two blocks of pixels or values x and y is defined as
                                                        (                                                2                  ⁢                                      μ                    x                                    ⁢                                      μ                    y                                                  +                                  C                  1                                            )                        ⁢                          (                                                2                  ⁢                                      σ                    xy                                                  +                                  C                  2                                            )                                                          (                                                μ                  x                  2                                +                                  μ                  y                  2                                +                                  C                  1                                            )                        ⁢                          (                                                σ                  x                  2                                +                                  σ                  y                  2                                +                                  C                  2                                            )                                      ,                            (        1        )            
where μx is the mean value of block x, μy is the mean value of block y, σx2 is the variance of block x, σy2 is the variance of block y, σxy is the covariance between blocks x and y, and C1 and C2 are constants.
The input image x (or block) and the predicted output image y (or block) are decomposed into S levels and L orientations using an orientation-selective convolution kernel (steerable filter). As a result, there are S×L sub-bands plus one low-pass band and one high-pass band. The local mean, variance of corresponding sub-bands, and the covariance between x and y in each sub-band are determined using a small sliding window. For each subwindow in each sub-band, the local SSIM score is determined using (1). The overall SSIM score is a floating-point number between 0 and 1, which is an arithmetic average of all scores over all sub-bands and all subwindows. A more complex and accurate metric called structure texture similarity (STSIM) improves on SSIM by using statistics between sub-bands with different orientations and scale. STSIM also discards the σxy term from SSIM. Herein, we use SSIM for simplicity.
Related Work
Many researchers have studied the perceptual distortion visibility model or equivalent CSF model since the 1960s. Data for JND models generally come from psychophysical experiments. Using these models, researchers have proposed variants of coding algorithms. In general the models are classified into spatial domain models, which use local pixel values to determine the detection threshold of distortion, and sub-band domain models which usually adjust the CSF to determine the distortion tolerance in different sub-bands.
Sub-Band Domain JND Model and Perceptual Coding
One model is a contrast masking model in a generalized quadrature mirror filter (GQMF) sub-band domain given a uniform background gray level of 127. The baseline sub-band sensitivity and sensitivity adjustment for different luminance values are measured subjectively and tabulated. The texture masking is determined by the texture energy of each of the sub-bands. The overall sensitivity is the product of the baseline sensitivity, luminance adjustment and texture masking. Each sub-band is then DPCM-coded and quantized using the overall sensitivity as the quantization steps.
A subjective experiment on sensitivity for different colors (RGB) with 8×8 DCT basis functions has shown that the DC sensitivity plot has a “U-shape” vs. the background luminance, and the AC sensitivity logarithmically increases with respect to the basis function magnitude. Based on the experiments, a quantization matrix can be generated that can be used in a visually lossless coding scheme.
Parabolic fitting of 1D contrast sensitivity experimental results can be used to build a 2D CSF by orthogonally combining two 1D CSFs with consideration of the correlation between different dimensions. The CSF is a function of luminance which can be estimated from the image. A parabolic model can be used to determine the CSF in DCT sub-bands and was applied to the quantization matrix, see Ahumada et al., “Luminance-model-based DCT quantization for color image compression,” SPIE, Human Vision, Visual Processing, and Digital Display III, vol. 1666, p. 365, September 1992.
Another model is a contrast masking model for HVS. The response of the HVS depends on the excitation of the receptive field and on the inhibitory inputs. The relationship is modeled as excitation divided by inhibition called target threshold vs. masker contrast function (TvC).
A perceptual coder called DCTune uses luminance adaptation and contrast masking to adjust the quantization matrix in DCT sub-bands.
A Visible Difference Predictor (VDP) is an image quality metric considering luminance adaptation, contrast sensitivity and contrast masking. VDP assumes the luminance adaptation occurs locally around a fixation area. A 2D CSF model in the frequency domain is known. A Cortex Transform is used to decompose the image.
The JND model can be modified for a 16 sub-band GQMF decomposition. The model considers the luminance adaptation and inter or intra masking. The base luminance sensitivity is measured numerically without a closed form. The JND measurement is adapted locally without the need for transmitting side information.
A DCT-domain JND model for video coding considers contrast sensitivity, a luminance adaptation model, and texture contrast masking. Those aspects are assisted using edge detection and temporal contrast sensitivity, along with temporal frequency associated with eye movement.
A spatio-temporal JND model considers the eye movement for video in the DCT sub-bands. The model incorporates spatio-temporal CSF, eye movement, luminance adaptation and contrast.
Spatial Domain JND Model and Perceptual Coding
A spatial domain JND model can be used as the maximum of spatial masking and luminance adaption. The model can be relaxed by minimally noticeable distortion (MND), which can be used to control the rate-distortion optimizer. In addition to being used on a single image, JND models can also be applied to sequences.
Perceptual Based Rate-Distortion Optimization
In the HEVC and H.264/AVC standards, the encoder uses rate-distortion optimization (RDO) to output a bit-stream with the best picture quality with a rate R less than a given rate constraint Rc. This process can be expressed asmin{D} subject to R≦RC,  (2)
where D is the distortion measurement, usually based on MSE. In order to incorporate a perceptual quality measurement, D=1−SSIM in H.264/AVC, where SSIM is the structural similarity metric given above in (1). This produces a better perceptual quality than H.264/AVC.
Perceptual Based Template Matching
Template-matching techniques generally use the MSE or Mean Absolute Error (MAE) as the matching criteria. Sparse reconstruction can be used as the constraint to solve the matching problem, or template matching can be used to determine candidates in the reconstructed image, and then use the candidates to train Karhunen-Loeve Transform (KLT) adaptively for transform coding.
FIG. 1 shows the modules or processing steps in a conventional H.264/AVC or HEVC encoder. Prediction 110 is applied to an input signal, e.g., a next (target) block to be coded in an image or video, followed by a transformation 120, quantization 130, and coding 140 to produce an output signal 109, e.g., a compressed image or video. Feedback from RDO 150 is used for the predition and quantization.
H.264/AVC and HEVC iteratively use RDO to determine the optimal prediction and quantization. Conventional perceptual based image coding methods either focus on adjusting the quantization matrix based on JND metrics, or on using perceptual-based metrics to perform RDO.
Hence, it is desired to develop models and associated coding techniques that maintain the perceptual quality of processed images. It is also desired to develop models and associated coding techniques that jointly optimize multiple elements of the encoder or decoder such as prediction and quantization, using perceptual metrics.