The present invention relates to the filtering of images. In particular, the present invention relates to pipelining for filtering of reconstructed images in a decoder and a decoding loop of an encoder.
At present, the majority of standardized video coding algorithms are based on hybrid video coding. Hybrid video coding methods typically combine several different lossless and lossy compression schemes in order to achieve the desired compression gain. Hybrid video coding is also the basis for ITU-T standards (H.26x standards such as H.261, H.263) as well as ISO/IEC standards (MPEG-X standards such as MPEG-1, MPEG-2, and MPEG-4). The most recent and advanced video coding standard is currently the standard denoted as H.264/MPEG-4 advanced video coding (AVC) which is a result of standardization efforts by joint video team (JVT), a joint team of ITU-T and ISO/IEC MPEG groups. This codec is being further developed by Joint Collaborative Team on Video Coding (JCT-VC) under a name High-Efficiency Video Coding (HEVC), aiming, in particular at improvements of efficiency regarding the high-resolution video coding.
A video signal input to an encoder is a sequence of images called frames, each frame being a two-dimensional matrix of pixels. All the above-mentioned standards based on hybrid video coding include subdividing each individual video frame into smaller blocks consisting of a plurality of pixels. The size of the blocks may vary, for instance, in accordance with the content of the image. The way of coding may be typically varied on a per block basis. The largest possible size for such a block, for instance in HEVC, is 64×64 pixels. It is then called the largest coding unit (LCU). In H.264/MPEG-4 AVC, a macroblock (usually denoting a block of 16×16 pixels) was the basic image element, for which the encoding is performed, with a possibility to further divide it in smaller subblocks to which some of the coding/decoding steps were applied.
Typically, the encoding steps of a hybrid video coding include a spatial and/or a temporal prediction. Accordingly, each block to be encoded is first predicted using either the blocks in its spatial neighborhood or blocks from its temporal neighborhood, i.e. from previously encoded video frames. A block of differences between the block to be encoded and its prediction, also called block of prediction residuals, is then calculated. Another encoding step is a transformation of a block of residuals from the spatial (pixel) domain into a frequency domain. The transformation aims at reducing the correlation of the input block. Further encoding step is quantization of the transform coefficients. In this step the actual lossy (irreversible) compression takes place. Usually, the compressed transform coefficient values are further compacted (losslessly compressed) by means of an entropy coding. In addition, side information necessary for reconstruction of the encoded video signal is encoded and provided together with the encoded video signal. This is for example information about the spatial and/or temporal prediction, amount of quantization, etc.
FIG. 1 is an example of a typical H.264/MPEG-4 AVC and/or HEVC video encoder 100. A subtractor 105 first determines differences e between a current block to be encoded of an input video image (input signal s) and a corresponding prediction block ŝ, which is used as a prediction of the current block to be encoded. The prediction signal may be obtained by a temporal or by a spatial prediction 180. The type of prediction can be varied on a per frame basis or on a per block basis. Blocks and/or frames predicted using temporal prediction are called “inter”-encoded and blocks and/or frames predicted using spatial prediction are called “intra”-encoded. Prediction signal using temporal prediction is derived from the previously encoded images, which are stored in a memory. The prediction signal using spatial prediction is derived from the values of boundary pixels in the neighboring blocks, which have been previously encoded, decoded, and stored in the memory. The difference e between the input signal and the prediction signal, denoted prediction error or residual, is transformed 110 resulting in coefficients, which are quantized 120. Entropy encoder 190 is then applied to the quantized coefficients in order to further reduce the amount of data to be stored and/or transmitted in a lossless way. This is mainly achieved by applying a code with code words of variable length wherein the length of a code word is chosen based on the probability of its occurrence.
Within the video encoder 100, a decoding unit is incorporated for obtaining a decoded (reconstructed) video signal s′. In compliance with the encoding steps, the decoding steps include dequantization and inverse transformation 130. The so obtained prediction error signal e′ differs from the original prediction error signal due to the quantization error, called also quantization noise. A reconstructed image signal s′ is then obtained by adding 140 the decoded prediction error signal e′ to the prediction signal ŝ. In order to maintain the compatibility between the encoder side and the decoder side, the prediction signal ŝ is obtained based on the encoded and subsequently decoded video signal which is known at both sides the encoder and the decoder.
Due to the quantization, quantization noise is superposed to the reconstructed video signal. Due to the block-wise coding, the superposed noise often has blocking characteristics, which result, in particular for strong quantization, in visible block boundaries in the decoded image. Such blocking artifacts have a negative effect upon human visual perception. In order to reduce these artifacts, a deblocking filter 150 is applied to every reconstructed image block. The deblocking filter is applied to the reconstructed signal s′. For instance, the deblocking filter of H.264/MPEG-4 AVC has the capability of local adaptation. In the case of a high degree of blocking noise, a strong (narrow-band) low pass filter is applied, whereas for a low degree of blocking noise, a weaker (broad-band) low pass filter is applied. The strength of the low pass filter is determined by the prediction signal ŝ and by the quantized prediction error signal e′. Deblocking filter generally smoothes the block edges leading to an improved subjective quality of the decoded images. Moreover, since the filtered part of an image is used for the motion compensated prediction of further images, the filtering also reduces the prediction errors, and thus enables improvement of coding efficiency.
After a deblocking filter, a sample adaptive offset 155 and/or adaptive loop filter 160 may be applied to the image including the already deblocked signal s″. Whereas the deblocking filter improves the subjective quality, Sample Adaptive Offset (SAO) and ALF aim at improving the pixel-wise fidelity (“objective” quality). In particular, SAO adds an offset in accordance with the immediate neighborhood of a pixel. The Adaptive Loop Filter (ALF) is used to compensate image distortion caused by the compression. Typically, the adaptive loop filter is a Wiener filter with filter coefficients determined such that the mean square error (MSE) between the reconstructed s′ and source images s is minimized. The coefficients of ALF may be calculated and transmitted on a frame basis. ALF can be applied to the entire frame (image of the video sequence) or to local areas (blocks). An additional side information indicating which areas are to be filtered may be transmitted (block-based, frame-based or quadtree-based).
In order to be decoded, inter-encoded blocks require also storing the previously encoded and subsequently decoded portions of image(s) in the reference frame buffer 170. An inter-encoded block is predicted 180 by employing motion compensated prediction. First, a best-matching block is found for the current block within the previously encoded and decoded video frames by a motion estimator. The best-matching block then becomes a prediction signal and the relative displacement (motion) between the current block and its best match is then signalized as motion data in the form of three-dimensional motion vectors within the side information provided together with the encoded video data. The three dimensions consist of two spatial dimensions and one temporal dimension. In order to optimize the prediction accuracy, motion vectors may be determined with a spatial sub-pixel resolution e.g. half pixel or quarter pixel resolution. A motion vector with spatial sub-pixel resolution may point to a spatial position within an already decoded frame where no real pixel value is available, i.e. a sub-pixel position. Hence, spatial interpolation of such pixel values is needed in order to perform motion compensated prediction. This may be achieved by an interpolation filter (in FIG. 1 integrated within Prediction block 180).
For both, the intra- and the inter-encoding modes, the differences e between the current input signal and the prediction signal are transformed 110 and quantized 120, resulting in the quantized coefficients. Generally, an orthogonal transformation such as a two-dimensional discrete cosine transformation (DCT) or an integer version thereof is employed since it reduces the correlation of the natural video images efficiently. After the transformation, lower frequency components are usually more important for image quality then high frequency components so that more bits can be spent for coding the low frequency components than the high frequency components. In the entropy coder, the two-dimensional matrix of quantized coefficients is converted into a one-dimensional array. Typically, this conversion is performed by a so-called zig-zag scanning, which starts with the DC-coefficient in the upper left corner of the two-dimensional array and scans the two-dimensional array in a predetermined sequence ending with an AC coefficient in the lower right corner. As the energy is typically concentrated in the left upper part of the two-dimensional matrix of coefficients, corresponding to the lower frequencies, the zig-zag scanning results in an array where usually the last values are zero. This allows for efficient encoding using run-length codes as a part of/before the actual entropy coding.
The H.264/MPEG-4 H.264/MPEG-4 AVC as well as HEVC includes two functional layers, a Video Coding Layer (VCL) and a Network Abstraction Layer (NAL). The VCL provides the encoding functionality as briefly described above. The NAL encapsulates information elements into standardized units called NAL units according to their further application such as transmission over a channel or storing in storage. The information elements are, for instance, the encoded prediction error signal or other information necessary for the decoding of the video signal such as type of prediction, quantization parameter, motion vectors, etc. There are VCL NAL units containing the compressed video data and the related information, as well as non-VCL units encapsulating additional data such as parameter set relating to an entire video sequence, or a Supplemental Enhancement Information (SEI) providing additional information that can be used to improve the decoding performance.
FIG. 2 illustrates an example decoder 200 according to the H.264/MPEG-4 AVC or HEVC video coding standard. The encoded video signal (input signal to the decoder) first passes to entropy decoder 290, which decodes the quantized coefficients, the information elements necessary for decoding such as motion data, mode of prediction etc. The quantized coefficients are inversely scanned in order to obtain a two-dimensional matrix, which is then fed to inverse quantization and inverse transformation 230. After inverse quantization and inverse transformation 230, a decoded (quantized) prediction error signal e′ is obtained, which corresponds to the differences obtained by subtracting the prediction signal from the signal input to the encoder in the case no quantization noise is introduced and no error occurred.
The prediction signal is obtained from either a temporal or a spatial prediction 280. The decoded information elements usually further include the information necessary for the prediction such as prediction type in the case of intra-prediction and motion data in the case of motion compensated prediction. The quantized prediction error signal in the spatial domain is then added with an adder 240 to the prediction signal obtained either from the motion compensated prediction or intra-frame prediction 280. The reconstructed image s′ may be passed through a deblocking filter 250, sample adaptive offset processing 255, and an adaptive loop filter 260 and the resulting decoded signal is stored in the memory 270 to be applied for temporal or spatial prediction of the following blocks/images.
The present invention particularly relates to in-loop filtering processing. State of the art hybrid video coders such as those illustrated in FIG. 1 and decoders such as those illustrated in FIG. 2, apply in-loop de-blocking filter (DF), Sample Adaptive Offset (SAO) and Adaptive Loop Filter (ALF) processing stages before the reconstructed frame is displayed on the screen or stored at the reference frame buffer. In such video coders/encoders, the filtering regions, i.e. the regions of an image, for which a common set of filter parameters is determined and set, are aligned with the boundaries of Largest Coding Units (LCU).
The hardware implementations usually use the pipelining design concept as the backbone. The pipeline is defined as a set of fixed operations that are executed one after another, wherein the output of the operation being the input of another. Since the pipeline is the backbone of the implementation, simplifications in the pipeline are considered very desirable.
The hardware implementation of the decoder and encoder usually employs on LCU-based processing, which means that every time a single largest coding unit (LCU) or a region comprising a plurality of adjacent LCUs is processed. An alternative hardware implementation, which will however not be further discussed in the framework of the present invention, is frame based implementation, which is a restrictive implementation since it requires a large amount of on-chip memory to be utilized.
In the simplest case of processing on a single LCU basis, during the processing of an LCU, the neighboring LCUs on the right and the bottom are not yet available, since their processing term has not yet come. Therefore, the filtering operations of SAO and ALF require special attention at the LCU borders, where the required samples are not yet available.
Thus, state of the art codec designs utilize a set of consecutive filtering operations to be performed one after the other, in a predefined filtering region (a single LCU or a plurality of adjacent LCUs). However, the following problem occurs:
Since the neighboring filtering regions are not available during the processing of a current filtering region, some of the samples at the borders of the filtering region cannot be processed by the filters right away. Instead, filtering operations at the filtering region boundaries are delayed and are performed together with the following filtering region in the decoding order. As a result, the filtering operation during the coding or decoding of a filtering region requires four different sets of filters, one filter set corresponding to a current filtering region, and three filter sets corresponding to the top, left and top-left neighbor filtering region (for delayed filtering). Therefore, the decoding or encoding pipeline needs to be designed to perform the filtering operation in four different regions with four different filters.