1. Field of the Invention
The present invention relates to the field of video image coding and decoding. In particular, the present invention relates to encoding/decoding employing filters with variable filter parameters.
2. Description of the Related Art
At present, the majority of standardized video coding algorithms are based on hybrid video coding. Hybrid video coding methods typically combine several different lossless and lossy compression schemes in order to achieve the desired compression gain. Hybrid video coding is also the basis for ITU-T standards (H.26x standards such as H.261, H.263) as well as ISO/IEC standards (MPEG-X standards such as MPEG-1, MPEG-2, and MPEG-4). The most recent and advanced video coding standard is currently the standard denoted as H.264/MPEG-4 advanced video coding (AVC) which is a result of standardization efforts by joint video team (JVT), a joint team of ITU-T and ISO/IEC MPEG groups. This codec is being further developed by Joint Collaborative Team on Video Coding (JCT-VC) under a name High-Efficiency Video Coding (HEVC), aiming, in particular at improvements of efficiency regarding the high-resolution video coding.
A video signal input to an encoder is a sequence of images called frames, each frame being a two-dimensional matrix of pixels. All the above-mentioned standards based on hybrid video coding include subdividing each individual video frame into smaller blocks consisting of a plurality of pixels. The size of the blocks may vary, for instance, in accordance with the content of the image. The way of coding may be typically varied on a per block basis. The largest possible size for such a block, for instance in HEVC, is 64×64 pixels. It is then called the largest coding unit (LCU). In H.264/MPEG-4 AVC, a macroblock (usually denoting a block of 16×16 pixels) was the basic image element, for which the encoding is performed, with a possibility to further divide it in smaller subblocks to which some of the coding/decoding steps were applied.
Typically, the encoding steps of a hybrid video coding include a spatial and/or a temporal prediction. Accordingly, each block to be encoded is first predicted using either the blocks in its spatial neighborhood or blocks from its temporal neighborhood, i.e. from previously encoded video frames. A block of differences between the block to be encoded and its prediction, also called block of prediction residuals, is then calculated. Another encoding step is a transformation of a block of residuals from the spatial (pixel) domain into a frequency domain. The transformation aims at reducing the correlation of the input block. Further encoding step is quantization of the transform coefficients. In this step the actual lossy (irreversible) compression takes place. Usually, the compressed transform coefficient values are further compacted (losslessly compressed) by means of an entropy coding. In addition, side information necessary for reconstruction of the encoded video signal is encoded and provided together with the encoded video signal. This is for example information about the spatial and/or temporal prediction, amount of quantization, etc.
FIG. 1 is an example of a typical H.264/MPEG-4 AVC and/or HEVC video encoder 100. A subtractor 105 first determines differences e between a current block to be encoded of an input video image (input signal s) and a corresponding prediction block ŝ, which is used as a prediction of the current block to be encoded. The prediction signal may be obtained by a temporal or by a spatial prediction 180. The type of prediction can be varied on a per frame basis or on a per block basis. Blocks and/or frames predicted using temporal prediction are called “inter”-encoded and blocks and/or frames predicted using spatial prediction are called “intra”-encoded. Prediction signal using temporal prediction is derived from the previously encoded images, which are stored in a memory. The prediction signal using spatial prediction is derived from the values of boundary pixels in the neighboring blocks, which have been previously encoded, decoded, and stored in the memory. The difference e between the input signal and the prediction signal, denoted prediction error or residual, is transformed 110 resulting in coefficients, which are quantized 120. Entropy encoder 190 is then applied to the quantized coefficients in order to further reduce the amount of data to be stored and/or transmitted in a lossless way. This is mainly achieved by applying a code with code words of variable length wherein the length of a code word is chosen based on the probability of its occurrence.
Within the video encoder 100, a decoding unit is incorporated for obtaining a decoded (reconstructed) video signal s′. In compliance with the encoding steps, the decoding steps include dequantization and inverse transformation 130. The so obtained prediction error signal e′ differs from the original prediction error signal due to the quantization error, called also quantization noise. A reconstructed image signal s′ is then obtained by adding 140 the decoded prediction error signal e′ to the prediction signal ŝ. In order to maintain the compatibility between the encoder side and the decoder side, the prediction signal ŝ is obtained based on the encoded and subsequently decoded video signal which is known at both sides the encoder and the decoder.
Due to the quantization, quantization noise is superposed to the reconstructed video signal. Due to the block-wise coding, the superposed noise often has blocking characteristics, which result, in particular for strong quantization, in visible block boundaries in the decoded image. Such blocking artifacts have a negative effect upon human visual perception. In order to reduce these artifacts, a deblocking filter 150 is applied to every reconstructed image block. The deblocking filter is applied to the reconstructed signal s′. For instance, the deblocking filter of H.264/MPEG-4 AVC has the capability of local adaptation. In the case of a high degree of blocking noise, a strong (narrow-band) low pass filter is applied, whereas for a low degree of blocking noise, a weaker (broad-band) low pass filter is applied. The strength of the low pass filter is determined by the prediction signal ŝ and by the quantized prediction error signal e′. Deblocking filter generally smoothes the block edges leading to an improved subjective quality of the decoded images. Moreover, since the filtered part of an image is used for the motion compensated prediction of further images, the filtering also reduces the prediction errors, and thus enables improvement of coding efficiency.
After a deblocking filter, a sample adaptive offset 155 and/or adaptive loop filter 160 may be applied to the image including the already deblocked signal s″. Whereas the deblocking filter improves the subjective quality, Sample Adaptive Offset (SAO) and ALF aim at improving the pixel-wise fidelity (“objective” quality). In particular, SAO adds an offset in accordance with the immediate neighborhood of a pixel. The Adaptive Loop Filter (ALF) is used to compensate image distortion caused by the compression. Typically, the adaptive loop filter is a Wiener filter with filter coefficients determined such that the mean square error (MSE) between the reconstructed s′ and source images s is minimized. The coefficients of ALF may be calculated and transmitted on a frame basis. ALF can be applied to the entire frame (image of the video sequence) or to local areas (blocks). An additional side information indicating which areas are to be filtered may be transmitted (block-based, frame-based or quadtree-based).
In order to be decoded, inter-encoded blocks require also storing the previously encoded and subsequently decoded portions of image(s) in the reference frame buffer 170. An inter-encoded block is predicted 180 by employing motion compensated prediction. First, a best-matching block is found for the current block within the previously encoded and decoded video frames by a motion estimator. The best-matching block then becomes a prediction signal and the relative displacement (motion) between the current block and its best match is then signalized as motion data in the form of three-dimensional motion vectors within the side information provided together with the encoded video data. The three dimensions consist of two spatial dimensions and one temporal dimension. In order to optimize the prediction accuracy, motion vectors may be determined with a spatial sub-pixel resolution e.g. half pixel or quarter pixel resolution. A motion vector with spatial sub-pixel resolution may point to a spatial position within an already decoded frame where no real pixel value is available, i.e. a sub-pixel position. Hence, spatial interpolation of such pixel values is needed in order to perform motion compensated prediction. This may be achieved by an interpolation filter (in FIG. 1 integrated within Prediction block 180).
For both, the intra- and the inter-encoding modes, the differences e between the current input signal and the prediction signal are transformed 110 and quantized 120, resulting in the quantized coefficients. Generally, an orthogonal transformation such as a two-dimensional discrete cosine transformation (DCT) or an integer version thereof is employed since it reduces the correlation of the natural video images efficiently. After the transformation, lower frequency components are usually more important for image quality then high frequency components so that more bits can be spent for coding the low frequency components than the high frequency components. In the entropy coder, the two-dimensional matrix of quantized coefficients is converted into a one-dimensional array. Typically, this conversion is performed by a so-called zig-zag scanning, which starts with the DC-coefficient in the upper left corner of the two-dimensional array and scans the two-dimensional array in a predetermined sequence ending with an AC coefficient in the lower right corner. As the energy is typically concentrated in the left upper part of the two-dimensional matrix of coefficients, corresponding to the lower frequencies, the zig-zag scanning results in an array where usually the last values are zero. This allows for efficient encoding using run-length codes as a part of/before the actual entropy coding.
The H.264/MPEG-4 H.264/MPEG-4 AVC as well as HEVC includes two functional layers, a Video Coding Layer (VCL) and a Network Abstraction Layer (NAL). The VCL provides the encoding functionality as briefly described above. The NAL encapsulates information elements into standardized units called NAL units according to their further application such as transmission over a channel or storing in storage. The information elements are, for instance, the encoded prediction error signal or other information necessary for the decoding of the video signal such as type of prediction, quantization parameter, motion vectors, etc. There are VCL NAL units containing the compressed video data and the related information, as well as non-VCL units encapsulating additional data such as parameter set relating to an entire video sequence, or a Supplemental Enhancement Information (SEI) providing additional information that can be used to improve the decoding performance.
FIG. 2 illustrates an example decoder 200 according to the H.264/MPEG-4 AVC or HEVC video coding standard. The encoded video signal (input signal to the decoder) first passes to entropy decoder 290, which decodes the quantized coefficients, the information elements necessary for decoding such as motion data, mode of prediction etc. The quantized coefficients are inversely scanned in order to obtain a two-dimensional matrix, which is then fed to inverse quantization and inverse transformation 230. After inverse quantization and inverse transformation 230, a decoded (quantized) prediction error signal e′ is obtained, which corresponds to the differences obtained by subtracting the prediction signal from the signal input to the encoder in the case no quantization noise is introduced and no error occurred.
The prediction signal is obtained from either a temporal or a spatial prediction 280. The decoded information elements usually further include the information necessary for the prediction such as prediction type in the case of intra-prediction and motion data in the case of motion compensated prediction. The quantized prediction error signal in the spatial domain is then added with an adder 240 to the prediction signal obtained either from the motion compensated prediction or intra-frame prediction 280. The reconstructed image s′ may be passed through a deblocking filter 250, sample adaptive offset processing 255, and an adaptive loop filter 260 and the resulting decoded signal is stored in the memory 270 to be applied for temporal or spatial prediction of the following blocks/images.
The information that is required for correct decoding and reconstruction of a video sequence is usually encoded and transmitted together with the video data in the transmitted bit stream. Information is usually allocated into video slices and different kinds of parameter sets. The particular syntax structures used and respective allocation schemes have a strong influence on coding efficiency as well as on the amount of data transmitted (network abstraction layer NAL).
Basically, there are two types of SAO and ALF filter estimation principles that are applied with standard hybrid coders, such as illustrated in FIG. 1. The first one is called frame-based filter parameter estimation (design). This means that the process of designing (optimizing) filter parameters is performed jointly for all of the pixels of a frame. In other words, in this approach a filter parameter set is designed jointly for all of the Largest Coding Units (LCU) of a frame.
The second type is called LCU-based filter parameter estimation. In this type, the process of designing filter parameters is performed one by one for each LCU in a frame. Usually, no look-ahead is allowed (as opposed to the frame-based method), meaning that the LCUs that follow the current LCU in the coding order are assumed to be unavailable to the filter design process.
Both types of filter estimation have certain advantages and drawbacks.
Frame-based filter estimation is superior to LCU-based estimation with respect to coding gain due to the joint estimation procedure. However, compared to the LCU-based approach, the frame-based approach creates additional delay in the encoder and requires additional external memory access. In view of the additional delay introduced by the frame-based approach, LCU-based ALF and SAO are more suitable for low-delay applications. In correspondence with the two different approaches to filter parameter estimation, two different syntax structures employed for encoding the filter parameter information have been developed.
A first syntax structure is called the frame-based filter parameter set syntax structure. This syntax structure is used to represent the filter parameter set that is designed for a whole frame. A frame-based syntax structure can be generated for each frame, meaning that the smallest unit is a frame. In accordance therewith, a single set of filter parameters for a filter is designed and transmitted corresponding to each frame in a sequence.
A second syntax structure is called the LCU-based filter parameter set syntax structure. The smallest syntax unit is an LCU. A parameter set syntax structure is generated for each LCU. The LCU-based syntax structure supports both frame-based filter parameter estimation and LCU-based filter parameter estimation. In accordance therewith, a filter parameter set for each filter is transmitted (signaled) for each LCU.
Further details regarding said syntax structures have been set forth in standardization documents and will be described in the detailed description section with reference to the respective standardization documents.
Both types of syntax structure have advantages and drawbacks that are closely related to the different types of filter parameter estimation schemes discussed above.
Since frame-based syntax is only applicable to frame-based filter parameter estimation, it creates an additional delay (frame-level encoding delay). Therefore, frame-based syntax is not suitable for low-delay applications such as teleconferencing. Further, the enhanced external memory access requirements in the encoder represent a drawback of frame-based syntax structures.
Therefore, the LCU-based syntax has been adopted to replace the frame-based syntax. The LCU-based syntax supports both LCU-based and frame-based filer estimation. Therefore, it is more flexible compared to frame-based parameter set syntax and can achieve lower encoding delays. However, it is a drawback of LCU-based syntax that an LCU parameter unit must be transmitted (signaled) for each LCU. Therefore, LCU-based syntax causes more parameter signaling overhead compared to the frame-based approach. Due to the higher level of signaling overhead, the LCU-based syntax causes coding loss compared to the frame-based syntax, even in the case of frame-based estimation. Since the filtering control parameters need to be signaled for each and every LCU in a frame, the size of the parameter syntax structure increases with increasing frame size and decreasing LCU size (i.e. increasing number of LCUs per frame).