1. Field of the Invention
The present invention relates to video coding systems and particularly to scalable video coding systems, which can be used in connection with the video coding standard H.264/AVC or with new MPEG video coding systems.
2. Description of the Related Art
The standard H.264/AVC is the result of a video standardization project of the ITU-T video coding expert group VCEG and the ISO/IEC motion picture expert group (MPEG). The main goals of this standardization project are to provide a clear video coding concept with very good compression behavior and at the same time to generate a network-friendly video representation, which comprise both application with “conversation character”, such as video telephony, as well as applications without conversion character (storage, broadcast, stream transmission).
Apart from the above-mentioned standard ISO/IEC 14496-10, there is also a plurality of publications relating to the standard. Merely exemplarily, reference is made to “The Emerging H.264-AVC standard”, Ralf Schäfer, Thomas Wiegand and Heiko Schwarz, EBU Technical Review, January 2003. Additionally, the expert publication “Overview of the H.264/AVC Video Coding Standard”, Thomas Wiegand, Gary J. Sullivan, Gesle Bjontegaard and Ajay Lothra, IEEE Transactions on Circuits and Systems for Video Technology, July 2003 as well as the expert publication “Context-based adaptive Binary Arithmethic Coding in the H.264/AVC Video Compression Standard”, Detlev Marpe, Heiko Schwarz and Thomas Wiegand, IEEE Transactions on Circuits and Systems for Video Technology, September 2003, comprise a detailed overview over different aspects of the video coding standard.
However, for a better understanding, an overview over the video coding/decoding algorithm will be given with reference to FIGS. 9 to 11.
FIG. 9 shows a full structure of a video coder, which generally consists of two different stages. Generally, the first stage, which generally operates video-related, generates output data, which are then subject to an entropy coding by a second stage, which is designated by 80 in FIG. 9. The data are data 81a, quantized transformation coefficients 81b as well as motion data 81c, wherein these data 81a, 81b, 81c are supplied to the entropy coder 80 to generate a coded video signal at the output of the entropy coder 80.
Specifically, the input video signal is partitioned and splitted, respectively, into macroblocks, wherein every macroblock has 16×16 pixels. Then, the association of the macroblocks to slice groups and slices is chosen, according to which every macroblock of every slice is processed by the net of operation blocks as illustrated in FIG. 8. It should be noted that an efficient parallel-processing of macroblocks is possible when different slices exist in a video picture. The association of macroblocks to slice groups and slices is performed via a block coder control 82 in FIG. 8. There are different slices, which are defined as follows:
I slice: The I slice is a slice wherein all macroblocks of the slice are coded by using an intra prediction.
P slice: Additionally to the coding types of the I slices, certain macroblocks of the P slice can also be coded by using an inter prediction with at least one motion-compensated prediction signal per prediction block.
B slice: Additionally to the coder types available in the P slice, certain macroblocks of the B slice can also be coded by using an inter prediction with two motion-compensated prediction signals per prediction block.
The above three coder types are very similar to the ones in earlier standards, but with the exception of using reference pictures, as will be described below. The following two coder types for slices are new in the standard H.264/AVC:
SP slice: It is also referred to as switch P slice, which is coded such that efficient switching between different precoded pictures is made possible.
SI slice: The SI slice is also referred to as switch I slice, which allows an exact adaptation of the macroblocks in a SP slice for a direct random access and for error recovery purposes.
All in all, slices are a sequence of macroblocks, which are processed in the order of a raster scan, if not a property of the flexible macroblock ordering FMO is used, which is also defined in the standard. A picture can be partitioned into one or several slices, as illustrated in FIG. 11. Thus, a picture is a collection of one or several slices. In that sense, slices are independent of one another, since their syntax elements can be analyzed (parsed) from the bit stream, wherein the values of the samples can be decoded correctly in the range of the picture represented by the slice, without requiring data from other slices, provided that used reference pictures are identical both in the coder and in the decoder. However, certain information from other slices can be required to apply the deblocking filter across slice borders.
The FMO characteristic modifies the way how pictures are partitioned into slices and macroblocks, by using the concept of slice groups. Every slice group is a set of macroblocks defined by a macroblock to slice group mapping, which is specified by the content of a picture parameter set and by certain information from slice headers. This macroblock to slice group mapping consists of a slice group identification number for every macroblock in the picture, wherein it is specified to which slice group the associated macroblock belongs. Every slice group can be partitioned into one or several slices, so that a slice is a sequence of macroblocks within the same slice group, which is processed in the order of a raster sampling within the set of macroblocks of specific slice group.
Every macroblock can be transmitted in one or several coder types, depending on the slice coder type. In all slice coder types, the following types of intra coding are supported, which are referred to as intra-4×4 or intra-16×16, wherein additionally a chroma prediction mode and an I-PCM prediction mode are supported.
The intra-4×4 mode is based on the prediction of every 4×4 chroma block separately and is very well suited for coding parts of a picture with outstanding details. The intra-16×16 mode, on the other hand, performs a prediction of the whole 16×16 chroma block and is more suited for coding “soft” regions of a picture.
Additionally to these two chroma prediction types, a separate chroma prediction is performed. As an alternative for intra-4×4 and intra-16×16, the I-4×4 coder type allows that the coder simply skips the prediction as well as the transformation coding and instead transmits the values of the coded samples directly. The I-PCM mode has the following purposes: It allows the coder to represent the values of the samples precisely. It provides a way to represent the values of very abnormal picture content exactly without data enlargement. Further, it allows to determine a hard boundary for the number of bits, which a coder needs to have for macroblock handling without loss of coding efficiency.
In contrary to earlier video coding standards (namely H.263 plus and MPEG-4 visual), where the intra prediction has been performed in the transformation domain, the intra prediction in H.264/AVC is always performed in the spatial domain, by referring to adjacent samples of previously coded blocks, which are on the left of and above, respectively, the block to be predicted (FIG. 10). In certain environments, where transmission errors occur, this can cause an error propagation, wherein this error propagation takes place due to the motion compensation in intra coded macroblocks. Thus, a limited intra coding mode can be signaled, which enables a prediction of only intra coded adjacent macroblocks.
When the intra-4×4 mode is used, every 4×4 block of spatially adjacent samples is predicted. The 16 samples of the 4×4 block are predicted by using previously decoded samples in adjacent blocks. One of 9 prediction modes can be used for every 4×4 block. Additionally to the “DC prediction” (where a value is used to predict the whole 4×4 block), 8 direction prediction modes are specified. These modes are suitable to predict direction structures in a picture, such as edges in different angles.
Additionally to the intra macroblock coder types, different predictive or motion-compensated coder types are specified as P macroblock types. Every P macroblock type corresponds to a specific partition of the macroblock into the block forms, which are used for a motion-compensated prediction. Partitions with luma block sizes of 16×16, 16×8, 8×8 or 8×16 samples are supported by the syntax. In the case of partitions of 8×8 samples, an additional syntax element is transmitted for every 8×8 partition. This syntax element specifies whether the respective 8×8 partition is further partitioned into partitions of 8×4, 4×8 or 4×4 luma samples and corresponding chroma samples.
The prediction signal for every prediction-coded M×M luma block is obtained by shifting a region of the respective reference picture specified by a translation motion vector and a picture reference index. Thus, if the macroblock is coded by using four 8×8 partitions, and when every 8×8 partition is further partitioned into four 4×4 partitions, a maximum amount of 16 motion vectors for a single P macroblock can be transmitted within the so-called motion field.
The quantization parameter slice QP is used to determine the quantization of the transformation coefficients in H.264/AVC. The parameter can assume 52 values. These values are disposed such that an increase of 1 with regard to the quantization parameter means an increase of the quantization step width by about 12%. This means that an increase of the quantization parameter by 6 causes an increase of the quantizer step width by exactly a factor of 2. It should be noted that a change of the step size by about 12% also means a reduction of the bit rate by about 12%.
The quantized transformation coefficients of a block are generally sampled in zigzag path and processed by using entropy coding methods. The 2×2 DC coefficients of the chroma component are sampled in raster scan sequence and all inverse transformation operations within H.264/AVC can be implemented by using only additions and shift operations of 16 bit integer values.
With reference to FIG. 9, the input signal is first partitioned picture by picture in a video sequence, for every picture, into the macroblocks with 16×16 pixels. Then, every picture is supplied to a subtractor 84, which subtracts the original picture, which is supplied by a decoder 85, which is contained in the coder. The subtraction result, which means the residual signals in the spatial domain, are now transformed, scaled and quantized (block 86) to obtain the quantized transformation coefficients on line 81b. For generating the subtraction signal, which is fed into the subtractor 874, the quantized transformation coefficients are first again scaled and inverse transformed (block 87), to be supplied to an adder 88, the output of which feeds the deblocking filter 89, wherein the output video signal, as, for example, will be decoded by a decoder, can be monitored at the output of the deblocking filter, for example for control purposes (output 90).
By using the decoded output signal at output 90, a motion estimation is performed in block 91. For motion estimation in block 90, a picture of the original video signal is supplied, as seen from FIG. 9. The standard allows two different motion estimations, namely a forward motion estimation and a backward motion estimation. In the forward motion estimation, the motion of the current picture is estimated with regard to the previous picture. In the backward motion estimation, however, the motion of the current picture is estimated by using the future picture.
The results of the motion estimation (block 91) are supplied to a motion compensation block 92, which performs a motion-compensated inter prediction, particularly when a switch 93 is switched to the inter prediction mode, as it is the case in FIG. 9. If, however, the switch 93 is switched to intra frame prediction, an intra frame prediction is performed by using a block 490. Therefore, the motion data are not required, since no motion compensation is performed for an intra frame prediction.
The motion estimation block 91 generates motion data and motion fields, respectively, wherein motion data and motion fields, respectively, which consist of motion vectors, are transmitted to the decoder so that a corresponding inverse prediction, which means reconstruction by using the transformation coefficients and the motion data, can be performed. It should be noted that in the case of a forward prediction, the motion vector can be calculated from the immediately previous picture and from several previous pictures, respectively. Above that, it should be noted that in the case of a backward prediction, a current picture can be calculated by using the immediately adjacent future picture and of course also by using further future pictures.
It is a disadvantage of the video coding concept illustrated in FIG. 9 that it provides no simple scalability possibility. As known in the art, the term “scalability” means a coder/decoder concept where the coder provides a scaled data stream. The scaled data stream comprises a base scaling layer as well as one or several enhancement scaling layers. The base scaling layer comprises a representation of the signal to be coded, generally with lower quality, but also with lower data rate. The enhancement scaling layer contains a further representation of the video signal, which provides a representation with improved quality with regard to the base scaling layer, typically together with the representation of the video signal in the base scaling layer. On the other hand, the enhancement scaling layer has, of course, individual bit requirements, so that the number of bits for representing the signal to be coded increases with every enhancement layer.
Depending on design and possibilities, a decoder will decode, either only the base scaling layer to provide comparatively qualitatively bad representation of the picture signal represented by the coded signal. With every “addition” of a further scaling layer, however, the decoder can improve the quality of the signal step by step (at the expense of the bit rate).
Depending on the implementation and the transmission channel from a coder to a decoder, at least the base scaling layer is transmitted, since the bit rate of the base scaling layer is typically so low that also a so far limited transmission channel will be sufficient. If the transmission channel allows no more bandwidth for the application, only the base scaling layer but no enhancement scaling layer will be transmitted. As a consequence, the decoder can generate merely a low quality representation of the picture signal. Compared to the unscaled case, where the data rate would have been so high that a transmission system would not have been possible, the low quality representation is advantageous. If the transmission channel allows the transmission of one or several enhancement layers, the coder will transmit one or several enhancement layers to the decoder, so that it can increase the quality of the output video signal step by step, depending on the request.
With regard to the coding of video sequences, two different scalings can be distinguished. One scaling is a temporal scaling, in so far that not all video frames of a video sequence are transmitted, but that for reducing the data rate, for example, only every second frame, every third frame, every fourth frame, etc. is transmitted.
The other scaling is the SNR scalability (SNR=signal to noise ratio), wherein every scaling layer, e.g. both the base scaling layer and the first, second, third, . . . enhancement scaling layer comprise all time information, but with varying quality. Thus, the base scaling layer would have a low data rate, but a low signal noise ratio, wherein this signal noise ratio can then be improved step by step by adding one enhancement scaling layer each.
The coder concept illustrated in FIG. 9 is problematic in that it is based on the fact that merely residual values are generated by the subtracter 84, and are then processed. These residual values are calculated based on prediction algorithms, in the arrangement shown in FIG. 9, which forms a closed loop by using the blocks 86, 87, 88, 89, 93, 94 and 84, wherein a quantization parameter enters the closed loop, which means in blocks 86, 87. If now a simple SNR scalability would be implemented in that for example every predicted residual signal is quantized first with a coarse quantizer step width, and then quantized step by step with finer quantizer step widths, by using enhancement layers, this would have the following consequences. Due to the inverse quantization and the prediction, particularly with regard to the motion estimation (block 91) and the motion compensation (block 92), which take place by using the original picture on the one hand and the quantized picture on the other hand, a “diverging” of the quantizer step widths results both in the coder and the decoder. This leads to the fact that the generation of the enhancement scaling layers on the coder side becomes very problematic. Further, processing the enhancement scaling layers on the decoder side becomes impossible, at least with regard to the elements defined in the standard H.264/AVC. The reason therefore is the closed loop in the video coder illustrated with regard to FIG. 9, wherein the quantization is contained.
In the standardization document JVT-I 032 t1 titled “SNR-Scalable Extension of H.264/AVC”, Heiko Schwarz, Detlev Marpe and Thomas Wiegand, presented in the ninth JVT meeting from 2nd to 5th Dec. 2003 in San Diego, a scalable extension to H.264/AVC is presented, which comprises a scalability both with regard to time and signal noise ratio (with equal or different temporal accuracy). Therefore, a lifting representation of time subband partitions is introduced, which allows the usage of known methods for motion-compensated prediction.
Wavelet based video coder algorithms, wherein lifting implementations are used for the wavelet analysis and for wavelet synthesis, are described in J.-R. Ohm, “Complexity and delay analysis of MCTF interframe wavelet structures”, ISO/IECJTC1/WG11 Doc.M8520, July 2002. Comments on scalability can also be found in D. Taubman, “Successive refinement of video: fundamental issues, past efforts and new directions”, Proc. of SPIE (VCIP '03), vol. 5150, pp. 649-663, 2003, wherein, however, significant coder structure alterations are required. According to the invention, a coder/decoder concept is achieved, which has, on the one hand, the scalability possibility and can, on the other hand, be based on elements in conformity with the standard, particularly, e.g., for the motion compensation.
Before reference will be made in more detail to a coder/decoder structure with regard to FIG. 3, first, a basic lifting scheme on the side of the coder and an inverse lifting scheme on the side of the decoder, respectively, will be illustrated with regard to FIG. 4. Detailed explanations about the background of the combination of lifting schemes and wavelet transformations can be found in W. Sweldens, “A custom design construction of biorthogonal wavelets”, J. Appl. Comp. Harm. Anal., vol. 3 (no. 2), pp. 186-200, 1996 and I. Daubechies and W. Sweldens, “Factoring wavelet transforms into lifting Steps”, J. Fourier Anal. Appl., vol. 4 (no. 3), pp. 247-269, 1998. Generally, the lifting scheme consists of three steps, the polyphase decomposition step, the prediction step and the update step.
The decomposition step comprises partitioning the input side data stream into an identical first copy for a lower branch 40a as well as an identical copy for an upper branch 40b. Further, the identical copy of the upper branch 40b is delayed by a time stage (z−1), so that a sample s2k+1 with an odd index k passes through a respective decimator and downsampler 42a, 42b, respectively, at the same as a sample with an even index S2k. The decimator 42a and 42b, respectively, reduces the number of samples in the upper and the lower branch 40b, 40a, respectively, by eliminating every second sample.
The second region II, which relates to the prediction step, comprises a prediction operator 43 as well as a subtracter 44. The third region, which means the update step, comprises an update operator 45 as well as an adder 46. On the output side, two normalizers 47, 48 exist, for normalizing the high-pass signal hk (normalizer 47) and for normalizing the low-pass signal lk through the normalizer 48.
Particularly, the polyphase decomposition leads to the partitioning of even and odd samples of a given signal s[k]. Since the correlation structure typically shows a local characteristic, the even and odd polyphase components are highly correlated. Thus, in a final step, a prediction (P) of the odd samples is performed by using the integer samples. The corresponding prediction operator (P) for every odd sample sodd[k]=s[2k+1] is a linear combination of the adjacent even samples seven[k]=s[2k], i.e.
            P      ⁡              (                  s          even                )              ⁡          [      k      ]        =            ∑      l                            ⁢                  ⁢                  p        l            ⁢                                    s            even                    ⁡                      [                          k              +              l                        ]                          .            
As a result of the prediction step, the odd samples are replaced by their respective prediction residual valuesh[k]=sodd[k]−P(seven)[k]. 
It should be noted that the prediction step is equivalent to performing a high-pass filter of a two channel filter bank, as it is illustrated in I. Daubechies and W. Sweldens, “Factoring wavelet transforms into lifting steps”, J. Fourier Anal. Appl. vol 4 (no. 3), pp. 247-269, 1998.
In the third step of the lifting scheme, low-pass filtering is performed, by replacing the even samples seven[k] by a linear combination of prediction residual values h[k]. The respective update operator U is given by
            U      ⁡              (        h        )              ⁡          [      k      ]        =            ∑      l                            ⁢                  ⁢                  u        l            ⁢                        h          ⁡                      [                          k              +              l                        ]                          .            
By replacing the even samples withl[k]=seven[k]+U(h)[k]the given signal s[k] can finally be represented by l(k) and h(k), wherein every signal has half the sample rate. Since both the update step and the prediction step are fully invertible, the corresponding transformation can be interpreted as critically sampled perfect reconstruction filter bank. Indeed, it can be shown that any biorthogonal family of wavelet filters can be realized by a sequence of one or several prediction steps and one or several update steps. For a normalization of low-pass and high-pass components, the normalizers 47 and 48 are supplied with suitably chosen scaling factors Fl and Fh, as has been explained.
The inverse lifting scheme, which corresponds to the synthesis filter bank, is shown in FIG. 4 on the right hand side. It consists simply of the application of the prediction and update operator in inverse order and with inverse signs, followed by the reconstruction by using the even and odd polyphase components. Specifically, the right decoder shown in FIG. 4 comprises again a first decoder region I, a second decoder region II as well as a third decoder region III. The first decoder region cancels the effect of the update operator 45. This is effected by supplying the high-pass signal, which has been re-normalized by a further normalizer 50, to the update operator 45. Then, the output signal of the decoder side update operator 45 is supplied to a subtracter 52, in contrary to the adder 46 in FIG. 4. Correspondingly, the output signal of the predictor 43 is processed, the output signal of which is now supplied to an adder 53 and not to a subtracter as on the coder side. Now, an upsampling of the signal by the factor 2 takes place in every branch (blocks 54a, 54b). Then, the upper branch is shifted by one sample into the future, which is equivalent to delaying the lower branch, to perform then an addition of the data streams on the upper branch and the lower branch in an adder 55, to obtain the reconstructed signal sk at the output of the synthesis filter bank.
Several wavelets can be implemented by the predictor 43 and the update-operator 45, respectively. If the so-called hair wavelet is to be implemented, the prediction operator and the update operator are given by the following equation:
                                          P            Hair                    ⁡                      (                          s              even                        )                          ⁡                  [          k          ]                    =                                    s            ⁡                          [                              2                ⁢                k                            ]                                ⁢                                          ⁢          and          ⁢                                          ⁢                                                    U                Hair                            ⁡                              (                h                )                                      ⁡                          [              k              ]                                      =                              1            2                    ⁢                      h            ⁡                          [              k              ]                                            ,                  ⁢          such      ⁢                          ⁢      that                  h      ⁡              [        k        ]              =                            s          ⁡                      [                                          2                ⁢                k                            +              1                        ]                          -                              s            ⁡                          [                              2                ⁢                k                            ]                                ⁢                                          ⁢          and          ⁢                                          ⁢                      l            ⁡                          [              k              ]                                          =                                    s            ⁡                          [                              2                ⁢                k                            ]                                +                                    1              2                        ⁢                          h              ⁡                              [                k                ]                                                    =                              1            2                    ⁢                      (                                          s                ⁡                                  [                                      2                    ⁢                    k                                    ]                                            +                              s                ⁡                                  [                                                            2                      ⁢                      k                                        +                    1                                    ]                                                      )                              correspond to the non-normalized high-pass and low-pass (analysis) output signal, respectively, of the hair filter.
In the case of the 5/3 biorthogonal spline wavelet, the low-pass and high-pass analysis filter of this wavelet have 5 and 3 filter taps, respectively, wherein the corresponding scaling function is a second order B spline. In coder applications for still pictures, such as JPEG 2000, this wavelet is used for a time subband coder scheme. In a lifting environment, the corresponding prediction and update operators of the 5/3 transformation are given as follows:
                    P                  5          /          3                    ⁡              (                  s          even                )              ⁡          [      k      ]        =                    1        2            ⁢              (                              s            ⁡                          [                              2                ⁢                k                            ]                                +                      s            ⁡                          [                                                2                  ⁢                  k                                +                2                            ]                                      )            ⁢                          ⁢      and      ⁢                          ⁢                                    U                          5              /              3                                ⁡                      (            h            )                          ⁡                  [          k          ]                      =                  1        4            ⁢              (                              h            ⁡                          [              k              ]                                +                      h            ⁡                          [                              k                -                1                            ]                                      )            
FIG. 3 shows a block diagram of a coder/decoder structure with exemplary four filter levels both on the side of the coder and on the side of the decoder. From FIG. 3, it can be seen that the first filter level, the second filter level, the third filter level and the fourth filter level are identical with regard to the coder. The filter levels with regard to the decoder are also identical. On the coder side, every filter level comprises a backward predictor Mi0 as well as a forward predictor Mi1 61 as central elements. The backward predictor 60 corresponds in principle to the predictor 43 of FIG. 4, while the forward predictor 61 corresponds to the update operator of FIG. 4.
In contrary to FIG. 4, it should be noted that FIG. 4 relates to a stream of samples, where a sample has an odd index 2k+1, while another sample has an even index 2k. However, as has already been explained with regard to FIG. 1, the notation in FIG. 3 relates to a group of pictures instead of to a group of samples. If a picture has for example a number of samples and pictures, respectively, this picture is fed in fully. Then, the next picture is fed in, etc. Thus, there are no longer odd and even samples, but odd and even pictures. According to the invention, the lifting scheme described for odd and even samples is applied to odd and even pictures, respectively, each of which has a plurality of samples. Now, the sample by sample predictor 43 of FIG. 4 becomes the backward motion compensation prediction 60, while the sample by sample update operator 45 becomes the picture by picture forward motion compensation prediction 61.
It should be noted that the motion filters, which consist of motion vectors and represent coefficients for the block 60 and 61, are calculated for two subsequent related pictures and are transmitted as side information from coder to decoder. However, it is a main advantage of the inventive concept that the elements 91, 92, as they are described with reference to FIG. 9 and standardized in standard H.264/AVC, can easily be used to calculate both the motion fields Mi0 and the motion fields Mi1. Thus, no new predictor/update operator has to be used for the inventive concept, but the already existing algorithm mentioned in the video standard, which is examined and checked for functionality and efficiency, can be used for the motion compensation in forward direction or backward direction.
Particularly, the general structure of the used filter bank illustrated in FIG. 3 shows a temporal decomposition of the video signal with a group of 16 pictures, which are fed in at an input 64. The decomposition is a dyadic temporal decomposition of the video signal, wherein in the embodiment shown in FIG. 3 with four levels 24=16 pictures, which means a group size of 16 pictures, is required to achieve the representation with the smallest temporal resolution, which means the signals at the output 28a and at the output 28b. Thus, if 16 pictures are grouped, this leads to a delay of 16 pictures, which makes the concept shown in FIG. 3 with four levels rather problematic for interactive applications. Thus, if interactive applications are aimed at, it is preferred to form smaller groups of pictures, such as to group four or eight pictures. Then, the delay is correspondingly reduced, so that the usage for interactive applications becomes possible. In cases where interactivity is not required, such as for storage purposes, etc., the number of pictures in a group, which means the group size, can be correspondingly increased, such as to 32, 64, etc. pictures.
In that way, an interactive application of the hair-based motion-compensated lifting scheme is used, which consists of the backward motion compensation prediction (Mi0), as in H.264/AVC, and that further comprises an update step, which comprises a forward motion compensation (Mi1). Both the prediction step and the update step use the motion compensation process, as it is illustrated in H.264/AVC. Further, not only the motion compensation is used, but also the deblocking filter 89 designated with the reference number 89 in FIG. 9.
The second filter level comprises again downsampler 66a, 66b, a subtracter 69, a backward predictor 67, a forward predictor 68 as well as an adder 70 and a further processing means to output the first and second high-pass picture of the second level at an output of the further processing means, while the first and second low-pass picture of the second level are output at the output of the adder 70.
Additionally, the coder in FIG. 3 comprises a third level as well as a fourth level, wherein a group of 16 pictures is fed into the fourth-level input 64. At a fourth-level high-pass output 72, which is also referred to as HP4, eight high-pass pictures quantized with a quantization parameter Q and correspondingly processed are output. Correspondingly, eight low-pass pictures are output at a low-pass output 73 of the fourth filter level, which is fed into an input 74 of the third filter level. This level, again, is effective to generate four high-pass pictures at a high-pass output 75, which is also referred to as HP3, and to generate four low-pass pictures at a low-pass output 76, which are fed into the input 10 of the second filter level and decomposed.
It should particularly be noted that the group of pictures processed by a filter level does not necessarily have to be video pictures originating from an original video sequence, but can also be low-pass pictures, which are output by a next higher filter level at a low-pass output of the filter level.
Further, it should be noted that the coder concept shown in FIG. 3 for 16 pictures can easily be reduced to eight pictures, when simply the fourth filter level is omitted and the group of pictures is fed into the input 74. In the same way, the concept shown in FIG. 3 can also be extended to a group of 32 pictures, by adding a fifth filter level and by outputting then 16 high-pass pictures at a high-pass output of the fifth filter level and feeding the sixteen low-pass pictures at the output of the fifth filter level into the input 64 of the fourth filter level.
The tree-like concept of the coder side is also applied to the decoder side, but now no longer, like on the coder side, from the high level to the lower level but, on the decoder side, from the lower level to the higher level. Therefore, the data stream is received from a transmission medium, which is schematically referred to as network abstraction layer 100, and the received bit stream is first subject to an inverse further processing by using the inverse further processing means, to obtain a reconstructed version of the first high-pass picture of the first level at the output of means 30a and a reconstructed version of the first-level low-pass picture at the output of block 30b of FIG. 3. Then, analogous to the right half of FIG. 4, first the forward motion compensation prediction is reversed via the predictor 61, to subtract then the output signal of the predictor 61 from the reconstructed version of the low-pass signal (subtracter 101).
The output signal of the subtracter 101 is fed into a backward compensation predictor 60 to generate a prediction result, which is added to the reconstructed version of the high-pass picture in an adder 102. Then, both signals, which means the signals in the lower branch 103a, 103b, are brought to the double sample rate, by using the upsampler 104a, 104b, wherein then the signal on the upper branch is either delayed or “accelerated”, depending on the implementation. It should be noted that the upsampling is performed by the bridge 104a, 104b simply by inserting a number of zeros which corresponds to the number of samples for a picture. The shift by the delay of a picture by the element shown with z−1 in the upper branch 103b against the lower branch 103a effects that the addition by an adder 106 causes that the two second-level low-pass pictures occur subsequently on the output side with regard to the adder 106.
The reconstructed versions of the first and second second-level low-pass picture are then fed into the decoder-side inverse filter of the second level and there they are combined again with the transmitted second-level high-pass pictures by the identical implementation of the inverse filter bank to obtain a sequence of four third-level low-pass pictures at an output 101 of the second level. The four third-level low-pass pictures are then combined in an inverse filter level of the third level with the transmitted third-level high-pass pictures to obtain eight fourth-level low-pass pictures in subsequent format at an output 110 of the inverse third-level filter. These eight third-level low-pass pictures will then be combined again with the eight fourth-level high-pass pictures received from the transmission medium 100 via the input HP4, in an inverse fourth-level filter, as discussed with regard to the first level, to obtain a reconstructed group of 16 pictures at an output 112 of the inverse fourth-level filter.
Thus, in every stage of the analysis filter bank, two pictures, either original pictures or pictures representing low-pass signals and generated in a next higher level, are decomposed into a low-pass signal and a high-pass signal. The low-pass signal can be considered as representation of the common characteristics of the input pictures, while the high-pass signal can be considered as representation of the differences between the input pictures. In the corresponding stage of the synthesis filter bank, the two input pictures are again reconstructed by using the low-pass signal and the high-pass signal. Since the inverse operations of the analysis step are performed in the synthesis step, the analysis/synthesis filter bank (without quantization, of course) guarantees a perfect reconstruction.
The only occurring losses occur due to the quantization in the further processing means, such as 26a, 26b, 18. If quantization is performed very finely, a good signal noise ratio is achieved. If, however, quantization is performed very coarsely, a relatively bad signal noise ratio is achieved, but with a low bit rate, which means low demand.
Without SNR scalability, a time scaling control could be implemented already with the concept shown in FIG. 3. Therefore, a time scaling control 120 is used, which is formed to obtain the high-pass and low-pass output, respectively, and the outputs of the further processing means (26a, 26b, 18 . . . ), respectively, at the input side to generate a scaled data stream from these partial data streams TP1, HP1, HP2, HP3, HP4, which has the processed version of the first low-pass picture and the first high-pass picture in a base scaling layer. Then, the processed version of the second high-pass picture could be accommodated in a first enhancement scaling layer. The processed versions of the third-level high-pass pictures could be accommodated in a second enhancement scaling layer, while the processed versions of the fourth-level high-pass pictures are introduced in a third enhancement scaling layer. Thereby, merely based on the base scaling layer, a decoder could already generate a sequence of lower-level low-pass pictures with a lower time quality, which means two first-level low-pass pictures per group of pictures. With the addition of every enhancement scaling layer, the number of reconstructed pictures per group can always be doubled. The functionality of the decoder is typically controlled by a scaling control, which is formed to detect how many scaling layers are contained in the data stream and how many scaling layers have to be considered by the decoder during decoding, respectively.
The JVT document JVT-J 035 with the title “SNR-Scalable Extension of H.264/AVC” Heiko Schwarz, Detlev Marpe and Thomas Wiegand, presented during the tenth JVT meeting in Waikoloa Hi., 8th to 12th Dec. 2003, shows a SNR scalable extension of the temporal decomposition scheme illustrated in FIGS. 3 and 4. Particularly, a time scaling layer is partitioned into individual “SNR scaling sublayers”, wherein a SNR base layer is obtained in such that a certain time scaling layer is quantized with a first coarser quantizer step width to obtain the SNR base layer. Then, among other things, an inverse quantization is performed, and the result signal from the inverse quantization is subtracted from the original signal to obtain a difference signal, which is then quantized with a finer quantizer step width to obtain the second scaling layer. However, the second scaling layer is requantized with the finer quantizer step width to subtract the signal obtained after the requantization from the original signal to obtain a further difference signal, which, again after quantization, but now with a finer quantizer step width, represents a second SNR scaling layer and an SNR enhancement layer, respectively.
Thus, it has been found out that the above described scalability schemes, which are based on the motion-compensated temporal filtering (MCTF), already provide a high flexibility with regard to the temporal scalability and also the SNR scalability. But there is still a problem in that the bit rate of several scaling layers together is still significantly above the bit rate, which can be achieved when pictures of the highest quality would be coded without scalability. Due to the side information for the different scaling layers, scalable coders might never obtain the bit rate of the unscaled case. However, the bit rate of a data stream with several scaling layers should approach the bit rate of the unscaled case as closely as possible.
Further, the scalability concept should provide high flexibility for all scalability types, which means a high flexibility both with regard to time and space and also with regard to SNR.
The high flexibility is particularly important where already pictures with low resolution would be sufficient but a higher temporal resolution is desirable. Such a situation results, for example, when fast changes exist in pictures, such as, for example, in videos of team sports, where additionally to the ball, many persons move at the same time.
A further disadvantage of existing scalability concepts is that they either use the identical motion data for all scaling layers, which either limits the flexibility of the scalability or results in a non-optimum motion prediction and an increasing residual signal of the motion prediction, respectively.
On the other hand, a completely different motion data transmission of two different scaling layers leads to a significant overhead, since particularly when relatively low SNR scaling layers are considered, where quantization is performed relatively coarse, the portion of motion data in the overall bit stream becomes noticeable. A flexible scalability concept, wherein different motion data and different scaling layers become possible at all, is thus paid for by an additional bit rate, which is particularly disadvantageous with regard to the fact that all efforts are to reduce the bit rate. Further, the additional bits for the transmission of motion data stand out particularly in the lower scaling layers, compared to the bits for the motion prediction residual values. However, exactly there, this is particularly unpleasant, since in the lower scaling layers the effort is made to obtain a sufficiently acceptable quality which means to use at least a sufficiently reasonable quantization parameter and at the same time to obtain a lower bit rate.