The H.264/AVC standard provides excellent coding efficiency but it does not consider scalable video coding (SVC). SVC provides different layers, usually a base-layer (BL) and an enhancement-layer (EL). The Motion Picture Expert Group (MPEG) works on enhanced functionality of the video codec. Various techniques were proposed, and the Joint Video Team (JVT) started a standard called JSVC, with corresponding reference software (JSVM). SVC provides temporal, SNR and spatial scalability for applications. The BL of JSVM is compatible with H.264, and most components of H.264 are used in JSVM as specified, so that only few components need to be adjusted according to the subband structure. Among all the scalabilities, spatial scalability is the most challenging and interesting, since it is hard to use the redundancy between the two spatial scalable layers.
SVC provides several techniques for spatial scalability, such as IntraBL mode, residual prediction or BLSkip (base layer skip) mode. These modes can be selected on macroblock (MB) level.
IntraBL mode uses the upsampled reconstructed BL picture to predict a MB in the EL, and only encodes the residual. Residual prediction tries to reduce the energy of the motion compensation (MC) residual of the EL by subtracting the upsampled MC residual of the BL. BLSkip mode utilizes the upsampled BL motion vector (MV) for a MB in the EL and requires only the residual to be written into the bit stream if a MB selects this mode. Thus, the BLSkip mode makes use of the redundancy between the MVs of a BL and its EL in the spatial scalability case.
For Inter coded pictures, including both P pictures and B pictures of SVC, residual prediction is used to decrease the energy of the residual for improving coding efficiency. The basic idea is to first get the predicted residual by upsampling the residual signal of the corresponding BL picture, wherein a 2-tap bilinear filter is used. Then the predicted residual is subtracted from the real residual which is obtained from the motion estimation in the EL, and the difference is coded by DCT, entropy coding etc.
Residual upsampling is commonly done MB by MB, and for each MB by 4×4, 8×8 or 16×16 subblocks, based on MC accuracy. If the MC accuracy is e.g. 16×16, the whole 16×16 MB uses just one motion vector; if the MC accuracy is 8×8, each four 8×8 sub-blocks may have different motion vectors. The residuals for different 8×8 sub-blocks have low correlation, so the upsampling process is done for four different sub-blocks. SVC utilizes a simple 2-tap bilinear filter, performing the upsampling process first in the horizontal and then in the vertical direction. The respective filter works on MB level, and thus cannot cross the boundary of an 8×8 block.
An option for the described procedure is whether to use residual prediction or not for a particular MB. A mode decision process tries different modes, all with or without residual prediction. This is called adaptive residual prediction.
The typical frame structure employed by H.264/SVC contains two intra-coded reference frames that are used at the receiver for Instantaneous Decoder Refresh (IDR), and then a number of intra-coded or inter-coded frames, which make several GOPs (group-of-pictures). Inter-coded frames can be interpolated or predicted. In wavelet decomposition, the EL of a GOP typically consists of several high-pass frames followed by a low-pass frame. A low-pass frame is used for both the preceding and the following high-pass frames, i.e. for two GOPs.