1. Technical Field
The present invention relates to video coding method of coding video signals and video decoding method of decoding the coded video signals, and in particular to methods of coding and decoding signals using filters for filtering the signals resulting from the coding and decoding.
2. Background Art
At present, the most of standardized video coding algorithms are based on hybrid video coding. Typically, hybrid video coding methods combine several different lossless and lossy compression schemes in order to achieve a desired compression gain. Hybrid video coding is also the basis for the ITU-T standards (the H.26× standards such as H.261 and H.263) as well as the ISO/IEC standards (the MPEG-X standards such as MPEG-1, MPEG-2, and MPEG-4). The most recent and advanced video coding standard is currently the standard denoted as H.264/MPEG-4 advanced video coding (AVC) which is a result of the standardization efforts by joint video team (JVT) that is a joint team of ITU-T and ISO/IEC MPEG groups.
A video signal input to an encoder is a sequence of images called frames. Each frame is a two-dimensional matrix of pixels. All the above-mentioned standards based on hybrid video coding include subdividing each individual video frame into smaller blocks each consisting of a plurality of pixels. Typically, a macroblock (usually denoting a block composed of 16×16 pixels) is an image element as a basic unit of coding. However, various particular coding steps may be performed for smaller image elements which are, for example, submacroblocks having a size of 8×8, 4×4, 16×8, or the like, or other units of blocks.
Typically, the coding steps in a hybrid video coding include a spatial and/or a temporal prediction. Accordingly, each current block to be coded is first predicted from previously coded video frames, that is, by using either the blocks in its spatial neighborhood or blocks from its temporal neighborhood. A block that is calculated next is a difference between the current block to be coded and a prediction result that is also referred to as a prediction residual or a prediction error signal. The next coding step is to transform a residual block (a prediction error signal) from the spatial (pixel) domain to a frequency domain. The transform aims at reducing the redundancy of the residual block. The still next coding step is to quantize the transform coefficients. In this step, the actual lossy (irreversible) compression is performed. Usually, the compressed transform coefficient values (quantized coefficients) are further compacted (losslessly compressed) by means of an entropy coding. In addition, supplementary information necessary to reconstruct the coded video signal is coded and provided together with the coded video signal. This information is, for example, information about a spatial and/or temporal prediction, the amount of quantization, or the like.
FIG. 1 is a block diagram showing an example of a typical video coding apparatus (encoder) 100 compliant with the H.264/MPEG-4 AVC standard. The H.264/MPEG-4 AVC standard is a combination of all the above-mentioned coding steps. A subtractor 105 first determines differences between a current block to be coded in an input video image (input signal) and a corresponding prediction block (a prediction signal). This difference is used to predict the current block to be coded. In H.264/MPEG-4 AVC, the prediction signal is generated either by a temporal prediction or by a spatial prediction. The type of prediction can be varied on a per frame basis or on a per macroblock basis. Macroblocks predicted using temporal prediction (inter prediction) are called inter-coded macroblcoks and macroblocks predicted using spatial prediction (intra prediction) are called intra-coded macroblcoks. The type of prediction for a video frame can be set by the user or selected by the video coding apparatus 100 so as to achieve a compression gain that is as high as possible. In accordance with the selected prediction type, an intra/inter switch 175 provides a corresponding prediction signal to the subtractor 105. The prediction signal which is generated using temporal prediction is calculated from a reconstructed image (a reconstructed image signal) which is stored in a memory 140. The prediction signal which is generated using spatial prediction is calculated from the value(s) of boundary pixel(s) in the neighboring block(s) which is/are previously coded, decoded, and stored in the memory 140. The memory 140 thus operates as a delay unit that allows a comparison between the current signal value to be coded and the prediction signal value generated from the previous signal value(s). The memory 140 can store a plurality of previously coded video frames. The difference between the input signal and the prediction signal is referred to as a prediction error signal or a residual. A transform/quantization unit 110 transforms the prediction error signal into coefficients of frequency components, and quantizes the transformed coefficients. An entropy coding unit 190 entropy-codes the quantized coefficients in order to further reduce the amount of data in a lossless way. Such reduction is mainly achieved by applying variable length coding using codewords having variable lengths that are determined based on the occurrence probabilities of the respective codewords.
Intra-coded images (also referred to as I-pictures, I-images or I-frames) consist only of macroblocks that are intra-coded, that is, intra-coded images. Thus, the intra-coded images can be decoded without reference to any other previously decoded image. The intra-coded images provide error resilience for the resulting coded video sequence. This is because the intra-coded images are images for removing (refreshing) errors that otherwise propagate from frame to frame in the video sequence due to temporal prediction. Moreover, each I-frame enables a random access within the resulting coded video sequence. Basically, intra-fame prediction is performed by using a predefined set of intra-prediction modes for predicting a current block based on the boundary pixels in the neighboring blocks already coded. Different spatial intra-prediction modes are performed by applying different two-dimensional prediction directions. This allows efficient spatial intra prediction in the case of various edge directions. The prediction signal generated by such an intra prediction is then subtracted from the input signal by the subtractor 105 as described above. In addition, information indicating a spatial intra prediction mode is entropy-coded and provided together with the coded video signal.
The video coding apparatus 100 includes a decoding unit which generates a decoded video signal. The video coding apparatus 100 further includes an inverse quantization/inverse transform unit 120 which executes the decoding steps corresponding to the coding steps. The inverse quantization/inverse transform unit 120 generates a quantized prediction error signal by inversely quantizing and inversely transforming the quantized coefficients. The quantized prediction error signal differs from the original prediction error signal due to a quantization error that is also referred to as a quantization noise. An adder 125 generates a reconstructed signal by adding the quantized prediction error signal to the prediction signal. In order to maintain the compatibility between the encoder side (the video coding apparatus 100) and the decoder side (the video decoding apparatus), a prediction signal known at both the encoder and decoder sides is generated using the reconstructed signal that is the video signal coded and then decoded. Due to the quantization, the quantization noise is superimposed to the reconstructed video signal. Due to the block-based coding, the superimposed noise often has blocking characteristics which result in noticeable block boundaries in the decoded image represented by the reconstructed signal, in particular when strong quantization is performed. Such blocking artifacts (block distortions) have a negative effect upon human visual perception.
In order to reduce these artifacts, a deblocking filter 130 performs deblocking filtering for each block of the decoded image. The deblocking filtering is performed on the reconstructed signal which is the sum of the prediction signal and the quantized prediction error signal. The reconstructed video signal that is the reconstructed signal after being subjected to the deblocking filtering is the decoded signal which is generally displayed at the decoder side (if no such post filtering is performed). The deblocking filter in H.264/MPEG-4 AVC can be applied locally. In the case of a high degree of blocking noise, a strong (narrow-band) low-pass filter is applied, whereas in the case of a low degree of blocking noise, a weaker (broad-band) low-pass filter is applied. The strength of the low-pass filter is determined by the prediction signal and by the quantized prediction error signal. A deblocking filter generally smoothes the block edges, which leads to an enhanced subjective quality of the decoded image. Moreover, since the filtered part of an image is used for the motion compensation prediction of the following images, the filtering also reduces the prediction errors, and thus enables increase in the coding efficiency.
Intra-coded macroblocks are filtered before being displayed, but intra prediction is carried out using the macroblocks represented by the reconstructed signal that is not yet filtered.
FIG. 2 is a diagram for illustrating processing performed by the deblocking filter 130. The deblocking filter 130 separates samples p3, p2, p1, and p0 of a first block 301 on its left and samples q3, q2, q1, and q0 of a second block 302 on its right, and performs deblocking filtering at the vertical block boundary 310. A linear deblocking filtering with four coefficients is applied to the input samples p2, p1, p0, q0, q1 and q2, which produces, as the samples already subjected to the deblocking filtering, the following filtered outputs “p0, new” and “q0, new”:p0,new=(p2−(p1<<1)+(p0+q0+1)>>1)>>1,q0,new=(q2−(q1<<1)+(q0+p0+1)>>1)>>1
The reconstructed video signal is then stored in the memory 140.
In order to be decoded, inter-coded images require also the previously coded and subsequently decoded image(s). Temporal prediction may be performed uni-directionally (that is, using only video frames temporally before the current frame to be coded), or bi-directionally (that is, using also video frames preceeding and following the current frame. Uni-directional temporal prediction results in inter-coded images called P-frames (P-pictures), and bi-directional temporal prediction results in inter-coded images called B-frames (B-pictures). In general, an inter-coded image may be composed of any of a P-macroblock and a B-macroblock, and possibly even an I-macroblock.
A motion compensation prediction unit 160 predicts an inter-coded macroblock (a P-macroblock or a B-macroblock). First, a motion estimation unit 165 detects a best-matching block for the current block within one of the previously coded and decoded video frames. The aforementioned prediction signal shows this best-matching block. The motion estimation unit 165 signals the relative displacement (motion) between the current block and its best matching block, as motion data in the form of a three-dimensional motion vector that is included in the supplementary information provided together with the coded video signal. The three dimensions consist of two spatial dimensions and one temporal dimension. In order to optimize the prediction accuracy, a motion vector may be determined with a spatial sub-pixel resolution such as the half pixel or quarter pixel resolution. A motion vector with a spatial sub-pixel resolution may point to a spatial position such as a sub-pixel position which is within an already decoded video frame and at which no real pixel value is available. Hence, spatial interpolation of such pixel values is needed in order to perform motion compensation prediction. The interpolation filter 150 interpolates such spatial pixel values. According to the H.264/MPEG-4 AVC standard, a six-tap Wiener interpolation filter having fixed filter coefficients and a bilinear filter are applied in order to generate pixel values at sub-pixel positions in both the vertical and horizontal directions.
In the intra- and inter-coding modes, the transform/quantization unit 110 transforms and quantizes the prediction error signals that are differences between the input signal and the prediction signal to generate quantized coefficients. Generally, an orthogonal transform such as a two-dimensional discrete cosine transform (DCT) or an integer version thereof is employed. The orthogonal transform is performed to reduce the redundancies of the natural video images efficiently. After the transform, lower frequency components are usually more important for the image quality than high frequency components. Thus, more bits can be spent for coding the low frequency components than the high frequency components. An entropy coding unit 190 converts the two-dimensional matrix of quantized coefficients into a one-dimensional array. Typically, this conversion is performed by a what is called zig-zag scanning. The zig-zag scanning is performed starting with the DC-coefficient in the upper left corner of the two-dimensional matrix and ending with the AC coefficient in the lower right corner according to a predetermined sequential order. The energy is typically concentrated in the lower frequencies corresponding to the left upper part of the two-dimensional matrix of coefficients. Thus, the zig-zag scanning usually results in an array where the last values are sequential zeros. In this way, it is possible to perform efficient coding using run-length codes as a part of or at a pre-stage of the actual entropy coding.
The H.264/MPEG-4 AVC employs scalar quantization which can be controlled by a quantization parameter (QP) and a customizable quantization matrix (QM). For each macroblock, a corresponding one of 52 quantizers is selected by a quantization parameter. In addition, such a quantization matrix is specifically designed to keep certain frequencies in the source to avoid degradation of image quality. A quantization matrix in the H.264/MPEG-4 AVC can be adapted to the video sequence and signaled together with the coded video signal.
The H.264/MPEG-4 AVC standard includes two functional layers that are a Video Coding Layer (VCL) and a Network Abstraction Layer (NAL). The VCL provides the coding functionality as briefly described above. The NAL encapsulates the coded prediction error signal together with the supplemantal information necessary for the decoding of the video into standardized units called NAL units according to their further application(s) such as transmission over a channel and/or storing in a storage. There are VCL NAL units containing the compressed video data and the related information. There are also non-VCL units that encapsulate additional data. Examples of such additional data include a parameter set relating to an entire video sequence, or recently added Supplemental Enhancement Information (SEI) providing additional information that can be used to increase the decoding performance.
In order to enhance the image quality, a what is called post filter 280 (see FIG. 3) may be applied at the decoder (the video decoding apparatus 200). The H.264/MPEG-4 AVC standard allows the sending of post filter data for such a post filtering via the Supplemental Enhancement Information (SEI) message. The post filter design unit 180 identifies the filter data (a what is called filter hint for post filtering) by comparing the locally decoded signal (the reconstructed video signal) and the original input signal. In general, the filter data is information used by a decoder to set up an appropriate filter condition. For example, the filter data may include filter coefficients as they are. However, the filter data may also include other information that enables the setup of the filter. Examples of the other information includes the cross-correlation information related to the uncompressed signal, cross-correlation information between the original input image and the decoded image, and cross-correlation information between the decoded image and the quantization noise. This cross correlation information can be used to calculate the filter coefficients. The filter data which is output by the post filter design unit 180 is also transmitted to the entropy coding unit 190 in order to be coded and inserted into the coded video signal.
The decoder may apply the filter data to the decoded signal before display of the decoded signal (the decoded video signal).
FIG. 3 is a block diagram of an exemplary video decoding apparatus (decoder) 200 compliant with the H.264/MPEG-4 AVC video coding standard. The input signal that is the coded video signal is first transmitted to the entropy decoding unit 290. The entropy decoding unit 290 entropy-decodes the input signal. This yields the quantized coefficients, the information elements necessary for decoding motion data, prediction modes, etc., and the filter data. The one-dimensional array of quantized coefficients is inversely scanned to be a two-dimensional matrix, and the two-dimensional matrix is then transmitted to the inverse quantization/inverse transform unit 220. The inverse quantization and inverse transform unit 220 generates a quantized prediction error signal by inversely quantizing and inversely transforming the quantized coefficients of the two-dimensional matrix. This corresponds to the differences generated by subtracting the prediction signal from the input signal input to the encoder in the case where no quantization noise is introduced
The prediction signal is generated from either a motion compensation prediction unit 260 or an intra prediction unit 270, respectively. The intra/inter switch 275 switches prediction signals to be output to the adder 225, according to an information element indicating the type of prediction applied at the encoder. An information element in the case of intra-prediction further includes information such as intra-prediction mode necessary for the intra prediction, and an information element in the case of motion compensation prediction further includes information such as motion data necessary for the motion compensation prediction. Depending on the current value of the motion vector, interpolation of pixel values may be required to perform motion compensation prediction. This interpolation is performed by an interpolation filter 250. The adder 225 generates the reconstructed signal by adding a quantized prediction error signal in the spatial domain to the prediction signal obtainable either from the motion compensation prediction unit 260 or the intra prediction unit 270. Furthermore, the adder 225 transmits the reconstructed signal to a deblocking filter 230. The deblocking filter 230 generates a reconstructed video signal by performing deblocking filtering on the reconstructed signal, and stores the reconstructed video signal in the memory 240. The reconstructed video signal is used for temporal prediction or spatial prediction of the following blocks.
The post filter 280 obtains the filter data entropy-decoded by the entropy decoding unit 290, and sets a filter condition such as a filter coefficient according to the filter data. In order to enhance the image quality, the post filter 280 applies the filtering according to the condition to the reconstructed video signal. In this way, the post filter 280 is capable of adapting to the characteristics of the video signal to be input to the encoder.
In summary, there are three types of filters used in the latest H.264/MPEG-4 AVC standard. The filters are an interpolation filter, a deblocking filter, and a post filter. In general, the suitability of a filter depends on the contents of the image to be filtered. Therefore, a filter design which enables adaptation to the image characteristics is advantageous. The coefficients of such a filter may be designed as Wiener filter coefficients.
FIG. 4 is a diagram illustrating a signal flow using a Wiener filter 400 for noise reduction. A noise n is added to an input signal s, resulting in a noisy signal s′ to be filtered. With the goal of reducing the noise n, the Wiener filter 400 is applied to the signal s′, resulting in the filtered signal s″. The Wiener filter 400 is designed to minimize the mean squared error between the input signal s which is the desired signal and the filtered signal s″. This means that Wiener filter coefficients w correspond to the solution of the optimization problem “argw min E[(S−S″)2]” which can be formulated as a system of a linear equation referred to as a Wiener-Hopf equation. The operator E[x] indicating the expected value of x. The solution is given by:w=R−1·p 
Here, w is an M×1 vector containing the optimal coefficients of a Wiener filter having an order of M that is a positive integer. Also, R−1 denotes the inverse of an M×M autocorrelation matrix R of the noisy signal s′ to be filtered. Also, p denotes an M×1 cross correlation vector between the noisy signal s′ to be filtered and the original signal s. For further details on adaptive filter design, see Non-patent Literature (NPL) 1. NPL 1 is incorporated herein by reference.
Thus, one of the advantageous effects of the Wiener filter 400 is that the filter coefficients can be determined from the autocorrelation of the corrupted (noisy) signal and the cross correlations of the corrupted signal and the desired signal. In video coding, quantization noise is superposed to the original (input) video signal in the quantization step. Wiener filtering in the context of video coding aims at the reduction of the superimposed quantization noise in order to minimize the mean squared error between the filtered reconstructed video signal and the original signal.
Filter information that is transmitted from the encoder to the decoder can either be the calculated filter coefficients as they are or the cross correlation vector p which is necessary for calculating the Wiener filter and which cannot be determined at the decoder. Transmitting such supplementary information may enhance the quality of filtering. Furthermore, it is possible to further enhance the filtering quality and thereby enhance the video quality by, for example, either (i) increasing the order of the filter or (ii) separately determining the respective parts of the video signal and/or separately applying filter coefficients to the respective parts of the video signal.