At present, the majority of standardized video coding algorithms are based on hybrid video coding. Hybrid video coding methods typically combine several different lossless and lossy compression schemes in order to achieve the desired compression gain. Hybrid video coding is also the basis for ITU-T standards (H.26x standards such as H.261, H.263) as well as ISO/IEC standards (MPEG-X standards such as MPEG-1, MPEG-2, and MPEG-4). The most recent and advanced video coding standard is currently the standard denoted as H.264/MPEG-4 advanced video coding (AVC) which is a result of standardization efforts by joint video team (JVT), a joint team of ITU-T and ISO/IEC MPEG groups.
A video signal input to an encoder is a sequence of images called frames, each frame being a two-dimensional matrix of pixels. All the above-mentioned standards based on hybrid video coding include subdividing each individual video frame into smaller blocks consisting of a plurality of pixels. Typically, a macroblock (usually denoting a block of 16×16 pixels) is the basic image element, for which the encoding is performed. However, various particular encoding steps may be performed for smaller image elements, denoted submacroblocks or simply blocks and having the size of, for instance, 8×8, 4×4, 16×8, etc.
Typically, the encoding steps of a hybrid video coding include a spatial and/or a temporal prediction. Accordingly, each block to be encoded is first predicted using either the blocks in its spatial neighborhood or blocks from its temporal neighborhood, i.e. from previously encoded video frames. A block of differences between the block to be encoded and its prediction, also called block of prediction residuals, is then calculated. Another encoding step is a transformation of a block of residuals from the spatial (pixel) domain into a frequency domain. The transformation aims at reducing the correlation of the input block. Further encoding step is quantization of the transform coefficients. In this step the actual lossy (irreversible) compression takes place. Usually, the compressed transform coefficient values are further compacted (losslessly compressed) by means of an entropy coding. In addition, side information necessary for reconstruction of the encoded video signal is encoded and provided together with the encoded video signal. This is for example information about the spatial and/or temporal prediction, amount of quantization, etc.
FIG. 1 is an example of a typical H.264/MPEG-4 AVC standard compliant video encoder 100. The H.264/MPEG-4 AVC standard combines all above-mentioned encoding steps. A subtractor 105 first determines differences between a current block to be encoded of an input video image (input signal) and a corresponding prediction block, which is used for the prediction of the current block to be encoded. In H.264/MPEG-4 AVC, the prediction signal is obtained either by a temporal or by a spatial prediction. The type of prediction can be varied on a per frame basis or on a per macroblock basis. Macroblocks predicted using temporal prediction are called inter-encoded and macroblocks predicted using spatial prediction are called intra-encoded. The type of prediction for a video frame can be set by the user or selected by the video encoder so as to achieve a possibly high compression gain. In accordance with the selected type of prediction, an intra/inter switch 175 provides corresponding prediction signal to the subtractor 105. The prediction signal using temporal prediction is derived from the previously encoded images, which are stored in a memory 140. The prediction signal using spatial prediction is derived from the values of boundary pixels in the neighboring blocks, which have been previously encoded, decoded, and stored in the memory 140. The memory unit 140 thus operates as a delay unit that allows a comparison between current signal values to be encoded and the prediction signal values generated from previous signal values. The memory 140 can store a plurality of previously encoded video frames. The difference between the input signal and the prediction signal, denoted prediction error or residual, is transformed resulting in coefficients, which are quantized 110. Entropy encoder 190 is then applied to the quantized coefficients in order to further reduce the amount of data in a lossless way. This is mainly achieved by applying a code with code words of variable length wherein the length of a code word is chosen based on the probability of occurrence thereof.
Intra-encoded images (called also I-type images or I frames) consist solely of macroblocks that are intra-encoded, i.e. intra-encoded images can be decoded without reference to any other previously decoded image. The intra-encoded images provide error resilience for the encoded video sequence since they refresh the video sequence from errors possibly propagated from frame to frame due to temporal prediction. Moreover, I frames enable a random access within the sequence of encoded video images. Intra-fame prediction uses a predefined set of intra-prediction modes, which basically predict the current block using the boundary pixels of the neighboring blocks already encoded. The different modes of spatial intra-prediction refer to different directions of the applied two-dimensional prediction. This allows efficient spatial intra-prediction in the case of various edge directions. The prediction signal obtained by such an intra-prediction is then subtracted from the input signal by the subtractor 105 as described above. In addition, spatial intra-prediction mode information is entropy encoded and provided together with the encoded video signal.
Within the video encoder 100, a decoding unit is incorporated for obtaining a decoded video signal. In compliance with the encoding steps, the decoding steps include inverse quantization and inverse transformation 120. The decoded prediction error signal differs from the original prediction error signal due to the quantization error, called also quantization noise. A reconstructed signal is then obtained by adding 125 the decoded prediction error signal to the prediction signal. In order to maintain the compatibility between the encoder side and the decoder side, the prediction signal is obtained based on the encoded and subsequently decoded video signal which is known at both sides the encoder and the decoder. Due to the quantization, quantization noise is superposed to the reconstructed video signal. Due to the block-wise coding, the superposed noise often has blocking characteristics, which result, in particular for strong quantization, in visible block boundaries in the decoded image. Such blocking artifacts have a negative effect upon human visual perception. In order to reduce these artifacts, a deblocking filter 130 is applied to every reconstructed image block. The deblocking filter is applied to the reconstructed signal, which is the sum of the prediction signal and the quantized prediction error signal. The video signal after deblocking is the decoded signal, which is generally displayed at the decoder side (if no post filtering is applied). The deblocking filter of H.264/MPEG-4 AVC has the capability of local adaptation. In the case of a high degree of blocking noise, a strong (narrow-band) low pass filter is applied, whereas for a low degree of blocking noise, a weaker (broad-band) low pass filter is applied. The strength of the low pass filter is determined by the prediction signal and by the quantized prediction error signal. Deblocking filter generally smoothes the block edges leading to an improved subjective quality of the decoded images. Moreover, since the filtered part of an image is used for the motion compensated prediction of further images, the filtering also reduces the prediction errors, and thus enables improvement of coding efficiency.
Intra-coded macroblocks are filtered before displaying, but intra prediction is carried out using the unfiltered reconstructed macroblocks.
In order to be decoded, inter-encoded images require also the previously encoded and subsequently decoded image(s). Temporal prediction may be performed uni-directionally, i.e., using only video frames ordered in time before the current frame to be encoded, or bi-directionally, i.e., using also video frames following the current frame. Uni-directional temporal prediction results in inter-encoded images called P frames; bi-directional temporal prediction results in inter-encoded images called B frames. In general, an inter-encoded image may comprise any of P-, B-, or even I-type macroblocks. An inter-encoded macroblock (P- or B-macroblock) is predicted by employing motion compensated prediction 160. First, a best-matching block is found for the current block within the previously encoded and decoded video frames by a motion estimator 165. The best-matching block then becomes a prediction signal and the relative displacement (motion) between the current block and its best match is then signalized as motion data in the form of three-dimensional motion vectors within the side information provided together with the encoded video data. The three dimensions consist of two spatial dimensions and one temporal dimension. In order to optimize the prediction accuracy, motion vectors may be determined with a spatial sub-pixel resolution e.g. half pixel or quarter pixel resolution. A motion vector with spatial sub-pixel resolution may point to a spatial position within an already decoded frame where no real pixel value is available, i.e. a sub-pixel position. Hence, spatial interpolation of such pixel values is needed in order to perform motion compensated prediction. This is achieved by interpolation filter 150. According to the H.264/MPEG-4 AVC standard, a six-tap Wiener interpolation filter with fixed filter coefficients and a bilinear filter are applied in order to obtain pixel values for sub-pixel positions in vertical and horizontal directions separately.
For both, the intra- and the inter-encoding modes, the differences between the current input signal and the prediction signal are transformed and quantized by the unit 110, resulting in the quantized coefficients. Generally, an orthogonal transformation such as a two-dimensional discrete cosine transformation (DCT) or an integer version thereof is employed since it reduces the correlation of the natural video images efficiently. After the transformation, lower frequency components are usually more important for image quality then high frequency components so that more bits can be spent for coding the low frequency components than the high frequency components. In the entropy coder, the two-dimensional matrix of quantized coefficients is converted into a one-dimensional array. Typically, this conversion is performed by a so-called zig-zag scanning, which starts with the DC-coefficient in the upper left corner of the two-dimensional array and scans the two-dimensional array in a predetermined sequence ending with an AC coefficient in the lower right corner. As the energy is typically concentrated in the left upper part of the two-dimensional matrix of coefficients, corresponding to the lower frequencies, the zig-zag scanning results in an array where usually the last values are zero. This allows for efficient encoding using run-length codes as a part of/before the actual entropy coding.
H.264/MPEG-4 AVC employs scalar quantization 110, which can be controlled by a quantization parameter (QP) and a customizable quantization matrix (QM). One of 52 quantizers is selected for each macroblock by the quantization parameter. In addition, quantization matrix is specifically designed to keep certain frequencies in the source to avoid losing image quality. Quantization matrix in H.264/MPEG-4 AVC can be adapted to the video sequence and signalized together with the video data.
The H.264/MPEG-4 AVC includes two functional layers, a Video Coding Layer (VCL) and a Network Abstraction Layer (NAL). The VCL provides the encoding functionality as briefly described above. The NAL encapsulates information elements into standardized units called NAL units according to their further application such as transmission over a channel or storing in storage. The information elements are, for instance, the encoded prediction error signal or other information necessary for the decoding of the video signal such as type of prediction, quantization parameter, motion vectors, etc. There are VCL NAL units containing the compressed video data and the related information, as well as non-VCL units encapsulating additional data such as parameter set relating to an entire video sequence, or a Supplemental Enhancement Information (SEI) providing additional information that can be used to improve the decoding performance.
In order to improve the image quality, a so-called post filter 280 may be applied at the decoder side 200. The H.264/MPEG-4 AVC standard allows the sending of post filter information for such a post filter via the SEI message. The post filter information is determined at the encoder side by means of a post filter design unit 180, which compares the locally decoded signal and original input signal. In general, the post filter information is an information allowing decoder to set up an appropriate filter. It may include directly the filter coefficients or another information enabling setting up the filter, such as cross-correlation information related to the uncompressed signal, such as cross-correlation information between the original input image and the decoded image or between the decoded image and the quantization noise. This cross-correlation information can be used to calculate the filter coefficients. The filter information, which is output by the post filter design unit 180 is also fed to the entropy coding unit 190 in order to be encoded and inserted into the encoded signal. At the decoder, the filter information may be used by a post filter, which is applied on the decoded signal before displaying.
FIG. 2 illustrates an example decoder 200 compliant with the H.264/MPEG-4 AVC video coding standard. The encoded video signal (input signal to the decoder) first passes to entropy decoder 290, which decodes the quantized coefficients, the information elements necessary for decoding such as motion data, mode of prediction etc., and the post filter information. The quantized coefficients are inversely scanned in order to obtain a two-dimensional matrix, which is then fed to inverse quantization and inverse transformation 220. After inverse quantization and inverse transformation, a decoded (quantized) prediction error signal is obtained, which corresponds to the differences obtained by subtracting the prediction signal from the signal input to the encoder in the case no quantization noise is introduced.
The prediction signal is obtained from either a temporal or a spatial prediction 260 and 270, respectively, which are switched 275 in accordance with a received information element signalizing the prediction applied at the encoder. The decoded information elements further include the information necessary for the prediction such as prediction type in the case of intra-prediction and motion data in the case of motion compensated prediction. Depending on the current value of the motion vector, interpolation of pixel values may be needed in order to perform the motion compensated prediction. This interpolation is performed by an interpolation filter 250. The quantized prediction error signal in the spatial domain is then added by means of an adder 225 to the prediction signal obtained either from the motion compensated prediction 260 or intra-frame prediction 270. The reconstructed image may be passed through a deblocking filter 230 and the resulting decoded signal is stored in the memory 240 to be applied for temporal or spatial prediction of the following blocks.
The post filter information is fed to a post filter 280, which sets up a post filter accordingly. The post filter is then applied to the decoded signal in order to further improve the image quality. Thus, the post filter is capable of adapting to the properties of the video signal entering the encoder.
In summary, in order to reduce noise, several in-loop and post filter schemes are possible within present image and video coding standards. In these filter schemes, a filter may be deployed as a post filter for filtering the decoded signal before outputting it, or as an in-loop filter for filtering any part of video signal during encoding and decoding, the filtered signal being typically stored into the frame memory in order to be used by the prediction. For instance, in current H.264/MPEG-4 AVC standard, an interpolation filter and a deblocking filter are employed as in-loop filters. A post filter may also be applied. In general, the suitability of a filter depends on the image to be filtered. The coefficients of the post filter may be designed as Wiener filter coefficients. The Wiener filter is designed to minimize the mean square error between an input signal, which is the desired signal, and the noisy signal after having applied the filter. The solution of a Wiener filter requires calculating the autocorrelation of the corrupted signal and the cross correlation between the input signal and the corrupted signal. In video coding, quantization noise is superposed to the original (input) video signal in the quantization step. Wiener filtering in the context of video coding aims at the reduction of the superposed quantization noise in order to minimize the mean squared reconstruction error.
Further details on adaptive filter design can be found for example in S. Haykin, “Adaptive Filter Theory”, Fourth Edition, Prentice Hall Information and System Sciences Series, Prentice Hall, 2002. Example of Wiener filter configurations for video coding can be found in EP 1841230. Moreover, European patent application 08020651.9 shows various possible positions and designs of filters in frequency domain which may be used during image/video encoding and decoding. Variants of noise reducing filters with more than one input are provided, for instance, in EP 2141927. Filters in the frequency domain can be used in order to reduce noise in the frequency domain, filters in the spatial domain can be used in order to reduce noise in the spatial domain.
However, the image and/or video signal may contain various noise components having different statistics. For instance, an image or video may contain the already mentioned quantization noise, blocking noise present due to separate encoding of picture blocks, and/or the noise present in the original image/video sequence caused by the capturing device. These noise components may all have different characteristics, which may further vary in time and also with the spatial location. For instance, video signals acquired by film cameras typically contain additive camera noise whereas video signals acquired by ultrasound sensors or synthetic aperture radar (SAR) sensors contain multiplicative noise. The blocking noise, on the other hand is present only at the coding block borders. The quantization noise represents another source of additive noise, which may be inserted either in the spatial or in the frequency domain, depending on the type of prediction error coding applied. The type of prediction error coding may, again, vary within an image on a block basis. Consequently, filtering with a plurality of filters applied at different stages of encoding and decoding may be essential for improving the quality of the output signal.