High Dynamic Range (HDR) with Wide Color Gamut (WCG) has become an increasingly hot topic within the TV and multimedia industry in the last couple of years. While screens capable of displaying the HDR video signal are emerging at the consumer market, over-the-top (OTT) players such as Netflix have announced that HDR content will be delivered to the end-user. Standardization bodies are working on specifying the requirements for HDR. For instance, in the roadmap for DVB, UHDTV1 phase 2 will include HDR support. MPEG is currently working on exploring how HDR video could be compressed.
HDR imaging is a set of techniques within photography that allows for a greater dynamic range of luminosity compared to standard digital imaging. Dynamic range in digital cameras is typically measured in f-stops, where one f-stop means doubling of the amount of light. A standard LCD HDTV using Standard Dynamic Range (SDR) can display less than or equal to 10 stops. HDR is defined by MPEG to have a dynamic range of over 16 f-stops. WCG is to increase the color fidelity from ITU-R BT.709 towards ITU-R BT.2020 such that more of the visible colors can be captured and displayed.
HDR is defined for UHDTV in ITU-R Recommendation BT.2020 while SDR is defined for HDTV in ITU-R Recommendation BT.709.
A color model is a mathematical model that defines the possible colors that can be presented using a predefined number of components. Examples of color models are RGB, Y′CbCr 4:2:0 (also called YUV 4:2:0), CIE1931 etc.
A picture element (pixel for short) is the smallest element of a digital image and holds the luminance and color information of that element. The luminance and color can be expressed in different ways depending on the use case. Displays usually have three color elements, red, green and blue which are lit at different intensities depending on what color and luminance is to be displayed. It becomes therefore convenient to send the pixel information in RGB pixel format to the display. Since the signal is digital the intensity of each component of the pixel must be represented with a fixed number of bits, referred to as the bit depth of the component. A bit depth of n can represent 2n different values, e.g. 256 values per component for 8 bits and 1024 values per component for 10 bits.
When video needs to be compressed it is convenient to express the luminance and color information of the pixel with one luminance component and two color components. This is done since the human visual system (HVS) is more sensitive to luminance than to color, meaning that luminance can be represented with higher accuracy than color. One commonly used format that allows for this separation is Y′CbCr 4:2:0 (also called YUV 4:2:0) where the Cb- and Cr-components have quarter resolution compared to the Y′ components. When encoding video, the non-linear gamma transfer function is typically applied to the linear RGB samples to obtain the non-linear R′G′B′ representation, and then a 3×3 matrix multiplication is applied to get to Y′CbCr. The resulting Y component is referred to as luma which is roughly equal to luminance. The true luminance is instead obtained by converting the linear RGB samples using a 3×3 matrix operation to get to XYZ in the CIE1931 color space. The luminance is the Y coordinate of this XYZ-vector. Sometimes one can refer to a function of the Y coordinate as luminance, for instance when a transfer function has been applied to Y. Likewise, the Cb and Cr components of Y′CbCr 4:2:0 together are called chroma, which is similar to but different from chrominance. To get the chrominance, the X and Z coordinates of the CIE 1931 are used. One chrominance representation is the coordinates (x,y) where x=X/(X+Y+Z) and y=Y/(X+Y+Z). Y′CbCr is not the only representation that attempts to separate luminance from chrominance, there also exist other formats such as YdZdx which is based on XYZ etc. However, Y′CbCr is the most commonly used representation. Before displaying samples, the chroma components are first upsampled to 4:4:4, e.g., the same resolution as the luma, and then the luma and chroma are converted to R′G′B′ and then converted to the linear domain before being displayed.
High Efficiency Video Coding (HEVC) is a block based video codec standardized by ITU-T and MPEG that utilizes both temporal and spatial prediction. Spatial prediction is achieved using intra (I) prediction within the current frame. Temporal prediction is achieved using inter (P) or bi-directional inter (B) prediction on block level from previously decoded reference pictures. The difference between the original pixel data and the predicted pixel data, referred to as a residual, is transformed into the frequency domain and quantized before being entropy coded and transmitted together with necessary prediction parameters such as mode selections and motion vectors. By quantizing the transformed residuals, the tradeoff between bitrate and quality of the video may be controlled.
The level of quantization is determined by the quantization parameter (QP). The quantization parameter (QP) is a key technique to control the qualitylbitrate of the residual in video coding. It is applied such that it controls the fidelity of the residual (typically transform coefficients) and thus also controls the amount of coding artifacts. When QP is high the transform coefficients are quantized coarsely resulting in fewer bits but also possibly more coding artifacts than when QP is small where the transform coefficients are quantized finely. A low QP thus generally results in high quality and a high QP results in low quality. In HEVC v1 (similarly also for H.264/AVC) the quantization parameter can be controlled on picture, slice, or block level. On picture and slice level it can be controlled individualy for each color component In HEVC v2 the quantization parameter for chroma can be individually controlled for the chroma components on a block level.
It is known from state of the art that the QP can be controlled based on the local luma level such that a finer quantization, e.g., a lower QP, is used for blocks with high local luma levels/small variations in local luma levels than for blocks with low local luma levels/large variations in local luma levels. The reason is that it is better to spend bits in smooth areas where errors are more visible than in highly textured areas where errors are masked. Similarly, it is easier to spot errors at high luminance levels than in low luminance levels, and since luma is often a good predictor for luminance, this works.
HEVC uses by default a uniform reconstruction quantization (URQ) scheme that quantizes frequencies equally. HEVC has the option of using quantization scaling matrices (also referred to as scaling lists), either default ones, or quantization scaling matrices that are signaled as scaling list data in the SPS or PPS. To reduce the memory needed for storage, scaling matrices may only be specified for 4×4 and 8×8 matrices. For the larger transformations of sizes 16×16 and 32×32 the signaled 8×8 matrix is applied by having 2×2 and 4×4 blocks share the same scaling value, except at the DC positions.
A scaling matrix, with individual scaling factors for respective transform coefficient, can be used to make a different quantization effect for respective transform coefficient by scaling the transform coefficients individually with respective scaling factor as part of the quantization. This enables for example that the quantization effect is stronger for higher frequency transform coefficients than for lower frequency transform coefficients. In HEVC default scaling matrices are defined for each transform size and can be invoked by flags in the Sequence Parameter Set (SPS) and/or the Picture Parameter Set (PPS). Scaling matrices also exist in H.264. In HEVC it is also possible to define own scaling matrices in SPS or PPS specifically for each combination of color component, transform size and prediction type (intra or inter mode).