In the SVC standard being currently defined by the Joint Video Team (JVT) of MPEG & ITU, an encoding solution for progressive video material is given. Spatial scalability is considered only for progressive material.
SVC employs hierarchical B (bi-directionally predicted) frame structures as a default during the encoding, wherein a predefined temporal pattern is assigned to the frames according to their display order, and this always leads to some default performance of the decoder.
Currently, JSVM encoders support open-loop encoding and closed-loop encoding.
For open-loop encoding, the B pictures are encoded by using the mode and motion information generated by doing the motion estimation and mode decision based on the original references. That is, reference frames for prediction at the encoder are frames that were previously encoded.
Closed-loop encoding uses reconstructed frames as reference frames, which contain a quantization error.
Normally, closed-loop encoding is better to constrain the error and reduce the possible propagation effect caused by quantization and inaccurate motion estimation. Open-loop encoding is more flexible for handling the FGS (Fine Grain Scalability) layer and can easier support MCTF (Motion-Compensated Temporal Filtering).
Reference frame lists for prediction of P and B pictures that are constructed at the encoder have always the same structure, depending on the GOP size. P and B pictures use one list (list_0) for forward prediction, i.e. prediction from frames with lower POC (picture_order_count) number. B pictures use also a second list (list_1) for backward prediction, i.e. prediction from frames with higher POC number. The reference lists are truncated after a specified number of reference frames. The lowest temporal level, or low-pass level, contains the Key pictures. For different spatial layers, such as base layer (BL) and enhancement layer (EL), the reference list construction method is the same.
There are some basic rules that known encoders follow: From the previous GOP (group-of-pictures), only the Key picture is used for the encoding process of the next GOP, while other pictures of the previous GOP will not be used and are removed from the short term reference list by MMCO (Memory Management Control Operation) commands.
Frames at the same temporal level do not reference each other, except for Key frames.
For motion estimation (ME) in closed-loop encoding, a frame will only refer frames with higher temporal level, because the encoder performs motion estimation for the higher temporal level frames first. For ME in open-loop encoding however the ME for lower temporal levels is done first.
Reference lists are generated by RPLR (Reference Picture List Reordering) commands. MMCO commands are used to remove the B frames (or rather: non-key pictures) and unused Key frames of previous GOPs out of the short term list. These commands are invoked at the slice header of the Key pictures.
To improve coding efficiency, quantization parameters (QP) can be adapted by scaling factors (SF). Frames are given different QPs at the encoder, which depends on two values according to the formulaqpi=qpi−1−6·log2(SF)
That means that the QP of each temporal level i is adjusted by the scaling factor SF, and the scaling factor is used to balance the residual energies of frames of different temporal levels.
For open-loop encoding, the scaling factor is calculated as the sum energy proportion of the blocks. Each blocks energy proportion is calculated depending on how it is predicted by other blocks. If it is bi-predicted, actually the energy proportion is calculated using the filter [−½, 1, −½]. To normalize the energy improvement of this block, a factor of (−½)2+12+(−½)2−1 is introduced. If the block is just in one direction predicted, the motion compensation (MC) uses the filter [1, −1]. To normalize the energy improvement for this block, a factor of (½)2+(½)2−1 is introduced.
In temporal level i, all blocks have these factors, and its sum is used to calculate the scaling factor of level i−1.
      ScalingFactor    i    =            ScalingFactor              i        -        1              ·                                        ∑                                          blocks                ⁢                                                                  ⁢                in                ⁢                                                                  ⁢                temporl                ⁢                                                                  ⁢                level                ⁢                                                                  ⁢                i                            -              1                                ⁢          factor                +        1            
For closed-loop encoding, the idea is the same, but when a temporal level i is encoded then it is unknown how many blocks in temporal level i−1 use bi-directional prediction. So a ratio of bi-prediction and one-direction prediction is estimated, e.g. 60/40.
However, to support interlace coding, which means all or some frames are coded as two interlacing field pairs, namely top field and bottom field, a different solution is needed.