The present invention relates to coding schemes for coding a spatially sampled information signal using sub-division and coding schemes for coding a sub-division or a multitree structure, wherein representative embodiments relate to picture and/or video coding applications.
In image and video coding, the pictures or particular sets of sample arrays for the pictures are usually decomposed into blocks, which are associated with particular coding parameters. The pictures usually consist of multiple sample arrays. In addition, a picture may also be associated with additional auxiliary samples arrays, which may, for example, specify transparency information or depth maps. The sample arrays of a picture (including auxiliary sample arrays) can be grouped into one or more so-called plane groups, where each plane group consists of one or more sample arrays. The plane groups of a picture can be coded independently or, if the picture is associated with more than one plane group, with prediction from other plane groups of the same picture. Each plane group is usually decomposed into blocks. The blocks (or the corresponding blocks of sample arrays) are predicted by either inter-picture prediction or intra-picture prediction. The blocks can have different sizes and can be either quadratic or rectangular. The partitioning of a picture into blocks can be either fixed by the syntax, or it can be (at least partly) signaled inside the bitstream. Often syntax elements are transmitted that signal the subdivision for blocks of predefined sizes. Such syntax elements may specify whether and how a block is subdivided into smaller blocks and associated coding parameters, e.g. for the purpose of prediction. For all samples of a block (or the corresponding blocks of sample arrays) the decoding of the associated coding parameters is specified in a certain way. In the example, all samples in a block are predicted using the same set of prediction parameters, such as reference indices (identifying a reference picture in the set of already coded pictures), motion parameters (specifying a measure for the movement of a blocks between a reference picture and the current picture), parameters for specifying the interpolation filter, intra prediction modes, etc. The motion parameters can be represented by displacement vectors with a horizontal and vertical component or by higher order motion parameters such as affine motion parameters consisting of six components. It is also possible that more than one set of particular prediction parameters (such as reference indices and motion parameters) are associated with a single block. In that case, for each set of these particular prediction parameters, a single intermediate prediction signal for the block (or the corresponding blocks of sample arrays) is generated, and the final prediction signal is build by a combination including superimposing the intermediate prediction signals. The corresponding weighting parameters and potentially also a constant offset (which is added to the weighted sum) can either be fixed for a picture, or a reference picture, or a set of reference pictures, or they can be included in the set of prediction parameters for the corresponding block. The difference between the original blocks (or the corresponding blocks of sample arrays) and their prediction signals, also referred to as the residual signal, is usually transformed and quantized. Often, a two-dimensional transform is applied to the residual signal (or the corresponding sample arrays for the residual block). For transform coding, the blocks (or the corresponding blocks of sample arrays), for which a particular set of prediction parameters has been used, can be further split before applying the transform. The transform blocks can be equal to or smaller than the blocks that are used for prediction. It is also possible that a transform block includes more than one of the blocks that are used for prediction. Different transform blocks can have different sizes and the transform blocks can represent quadratic or rectangular blocks. After transform, the resulting transform coefficients are quantized and so-called transform coefficient levels are obtained. The transform coefficient levels as well as the prediction parameters and, if present, the subdivision information is entropy coded.
In image and video coding standards, the possibilities for sub-dividing a picture (or a plane group) into blocks that are provided by the syntax are very limited. Usually, it can only be specified whether and (potentially how) a block of a predefined size can be sub-divided into smaller blocks. As an example, the largest block size in H.264 is 16×16. The 16×16 blocks are also referred to as macroblocks and each picture is partitioned into macroblocks in a first step. For each 16×16 macroblock, it can be signaled whether it is coded as 16×16 block, or as two 16×8 blocks, or as two 8×16 blocks, or as four 8×8 blocks. If a 16×16 block is sub-divided into four 8×8 block, each of these 8×8 blocks can be either coded as one 8×8 block, or as two 8×4 blocks, or as two 4×8 blocks, or as four 4×4 blocks. The small set of possibilities for specifying the partitioning into blocks in state-of-the-art image and video coding standards has the advantage that the side information rate for signaling the sub-division information can be kept small, but it has the disadvantage that the bit rate necessitated for transmitting the prediction parameters for the blocks can become significant as explained in the following. The side information rate for signaling the prediction information does usually represent a significant amount of the overall bit rate for a block. And the coding efficiency could be increased when this side information is reduced, which, for instance, could be achieved by using larger block sizes. Real images or pictures of a video sequence consist of arbitrarily shaped objects with specific properties. As an example, such objects or parts of the objects are characterized by a unique texture or a unique motion. And usually, the same set of prediction parameters can be applied for such an object or part of an object. But the object boundaries usually don't coincide with the possible block boundaries for large prediction blocks (e.g., 16×16 macroblocks in H.264). An encoder usually determines the sub-division (among the limited set of possibilities) that results in the minimum of a particular rate-distortion cost measure. For arbitrarily shaped objects this can result in a large number of small blocks. And since each of these small blocks is associated with a set of prediction parameters, which need to be transmitted, the side information rate can become a significant part of the overall bit rate. But since several of the small blocks still represent areas of the same object or part of an object, the prediction parameters for a number of the obtained blocks are the same or very similar.
That is, the sub-division or tiling of a picture into smaller portions or tiles or blocks substantially influences the coding efficiency and coding complexity. As outlined above, a sub-division of a picture into a higher number of smaller blocks enables a spatial finer setting of the coding parameters, whereby enabling a better adaptivity of these coding parameters to the picture/video material. On the other hand, setting the coding parameters at a finer granularity poses a higher burden onto the amount of side information in order to inform the decoder on the settings. Even further, it should be noted that any freedom for the encoder to (further) sub-divide the picture/video spatially into blocks tremendously increases the amount of possible coding parameter settings and thereby generally renders the search for the coding parameter setting leading to the best rate/distortion compromise even more difficult.