The invention relates to a method and coder for video coding and a decoding device.
Digital video data is generally compressed for storage or transmission in order significantly to reduce the enormous volume of data. Compression is effected both by eliminating the signal redundancy contained in the video data and by removing the irrelevant parts of the signal which cannot be perceived by the human eye. This is normally achieved by a hybrid coding method in which the image to be coded is firstly temporally predicted and the residual prediction error is then transformed into the frequency range, for example by a discrete cosine transformation, and quantized there and coded by a variable length code. Finally, the motion information and the quantized spectral coefficients are transmitted.
The better this prediction of the next image information to be transmitted, the smaller the prediction error remaining after the prediction and the lower the data rate which then has to be used for coding this error. A key object in the compression of video data thus involves obtaining as exact as possible a prediction of the image to be coded from the image information that has previously been transmitted.
The prediction of an image has until now been effected by firstly dividing the image for example into regular parts, typically square blocks of 8×8 or 16×16 pixels in size, and then, through motion compensation, determining for each of these image blocks a prediction from the image information already known in the receiver(blocks differing in size can, however, also be produced). Such a procedure can be seen in FIG. 1. Two basic prediction scenarios can be distinguished:                uni-directional prediction: here, the motion compensation is based here exclusively on the previously transmitted image and leads to so-called “P-frames”.        bi-directional prediction: the prediction of the image is effected by superimposing two images, one of which lies temporally ahead and another temporally behind and leads to so-called “B-frames”. It should be noted here that both reference images will already have been transmitted.        
In accordance with these two possible prediction scenarios, five directional modes are produced with motion compensated temporal filtering (MCTF) in MSRA's method, described in Jizheng Xu et al.; “3D subband video coding using Barbell lifting”, ISO/IEC JTC1/SC29/WG11 MPEG 68th Meeting, M10569/s05, Munich, March 2004, as can be seen in FIG. 2.
MCTF-based scalable video coding is used in order to provide a good video quality for a very large range of possible bit rates as well as of temporal and spatial resolution levels. The MCTF algorithms known today, however, show unacceptable results for reduced bit rates, which is attributable to the fact that too little texture (block information) is present in relation to the information which refers to the motion information (block structures and motion vectors) of a video defined by an image sequence.
What is needed therefore is a scalable form of motion information in order to achieve an optimal relationship between texture and motion data at each and every bit rate and resolution. To this end, a solution from MSRA (Microsoft Research Asia) is known from the Jizheng Xu et al. article identified above which represents the related art in MCTF algorithms.
The MSRA solution proposes representing motions layer-by-layer, or resolving them into successively more refined structures. The MSRA method thereby achieves the outcome that the quality of images at low bit rates is generally improved.
However, this solution has the disadvantage that it leads to some shifts in the reconstructed image, which can be attributed to a skew between the motion information and the texture.
An improvement in this regard is known from the German patent application with the application number 10 2004 038 110.0.
In the method described in the application, which simply does not transmit completely in particular a complete motion vector field (temporary block structures MV_QCIF, MV_CIF and MV—4CIF), created as per MSRA, that is defined at the encoder end, rather only the most significant part of the motion vector field is transmitted. The creation of the most significant part is effected by a type of refinement of the block structures which is achieved by virtue of the fact that, based on structural characteristics, only parts of the structural differences between consecutive block structures are determined and used for creating refined block structures.
A problem here is that not every visual quality achieved by a refined block structure and associated texture signifies an improvement compared with a visual quality achievable by a corresponding basic structure and associated texture.