1. Field of the Invention
The present disclosure relates to encoding of video signals.
A number of video coding systems such as MPEG-2 (see, e.g., (MPEG-2) ISO/IEC 13818-2, Draft International Standard, 1995(E)), H.263 (see, e.g., DRAFT ITU-T Recommendation H.263, 1995), MPEG-4 (see, e.g., (MPEG-4) ISO/IEC 14496-2, Final Draft of International Standard, December 1998), H.26L (see, e.g., ITU-T VCEG Draft H.26L—Test Model Long-term number 8 (TML-8)) are based on a video compression procedure that exploits the high degree of spatial and temporal correlation in natural video sequences. In order to do this, video encoders decompose video sequences in Groups Of Pictures (briefly GOP) by differentiating the features of every picture, within the GOP, in two different classes: Intra and Inter coded images (or frames). Intra frames can be compressed without extra information, whereas Inter frames need either Inter or Intra frame to start their compression.
By way of example, in FIG. 1 a basic video encoding scheme is shown wherein a hybrid DPCM/DCT encoding loop removes temporal redundancy using inter-frame motion compensation.
In the diagram of FIG. 1, an input data stream IS of digital signals representing a video sequence is subjected to frame reordering FR and motion estimation ME. The residual error images are further processed by discrete cosine transform DCT, which reduces spatial redundancy by de-correlating the pixels within a block and concentrating the energy of the block itself into a few low-order coefficients. Higher compression ratios are achieved through scalar quantization Q: for each macro-block the psycho-visual quantization matrix is multiplied by a scalar parameter, named quantization factor or briefly mquant. Finally, variable length coding VLC produces a bit-stream with good compression efficiency. This is fed via a MUX module to an output buffer OB for the compressed output bitstream OS.
In the block diagram of FIG. 1, RCA designates a module implementing the rate control algorithm, while I-Q and I-DCT designate the inverse quantization and inverse DCT processing functions to which the quantized data are subjected within the framework of the motion estimation/coding processes. Reference FS designates the associated frame memory.
All of the foregoing corresponds to principles and criteria which are well known in the art and do not require to be described in greater detail herein.
Generally speaking, every video sequence can be organized in five hierarchy layers or levels: group of pictures (GOP), picture, slice (a group of macro-blocks), macro-block (briefly MB) and block level. This last one is the elementary unit over which DCT operates and it comprises 8×8 pixels. A macro-block is comprised of four luminance (Y) blocks, covering a 16×16 area in a picture, and two chrominance (U and V) blocks, covering the same area (in the so-called 4:2:0 YUV format). The motion estimation and compensation stages operate on macro-blocks. There are three types of pictures: I-pictures, that are strictly intra-frame encoded; P-pictures, that are temporally predicted from earlier I or P frames; and finally B-pictures, that are bi-directionally interpolated between two I- or P-frames.
2. Description of the Related Art
Bit-rate control is a central problem in moving picture encoding systems. This aims at ensuring that the number of bits generated may be as close as possible to a target amount that is usually computed at the start of the video sequence encoding process.
Two main different bit-rate control modes are used for encoding any video source, named constant bit-rate (CBR) and variable bit-rate (VBR).
In a CBR mode, output buffer OB of FIG. 1 is required to produce an output stream at a constant rate. Due to the intrinsic structure of any video coding scheme, the final bit-stream is produced at variable bit-rate. Therefore, this has to be transformed into a constant bit-rate by the provision of the output buffer that acts as feedback controller. In this case, the quantization factor is adjusted for each MB by the rate-control mechanism to avoid output buffer overflow/underflow. Hence, a CBR coding mode cannot guarantee a constant video quality for all the scenes.
On the contrary, when the encoder operates in a VBR mode, an almost-constant (or smoothly variable) quantization factor is used for each type of frame and the output rate can vary according to the image content, thus generating an invariable quality regardless of the content of the video scenes. Therefore, a VBR coding mode cannot guarantee a constant output bit-rate for all the scenes.
As already noted, in both VBR and CBR methods, the parameter used to effect rate-control is the quantizer factor, or mquant. Tuning this parameter may be directed in two different directions: picture quality or bit saving. Decreasing mquant leads to an increase in image sharpness, while a higher number of bits is used for encoding, and vice versa.
One of the parameters involved in mquant computation is image “complexity”. A high complexity indicates a “sharp” picture, whereas a low value of this parameter indicates a quite “uniform” picture. Assuming that the human eye has a low perception for details, the rate controller usually adopts lower target-bit values in the areas that have a high complexity.
Generally speaking, rate control is relevant for encoding applications where there are any of:
i) a maximum permitted rate of the transmission channel,
ii) a fixed memory capacity to store the bit-stream in a medium like a CD-ROM, a digital-video-disk (DVD), a Hard-Disk on a digital video cassette, or
iii) editing capabilities that need regularly spaced GOPs.
The encoding systems are supposed to be “single-pass”, that is, the encoding process is done once per picture. “Multi-pass” systems compress several times one or more GOPs and then select the best compression strategy according to the results generated by previous encoding.
Whatever rate control method applied in the encoding system, either CBR or VBR, the encoding system requires that entering and removing coded data will not cause the video buffer verifier (VBV) to overflow or underflow. The VBV is a virtual buffer (not shown in FIG. 1, where only the output buffer is illustrated) maintained in the encoder scheme: its fullness is updated and monitored to simulate the entering and removing of coded data to and from the physical buffer of the video decoder scheme.
An important parameter is the so-called vbv_delay, that is the time interval from the arrival of start code of the current frame to its decode time (in MPEG-2 it is expressed in number of periods of a 90 KHz clock).
If the vbv_delay value is 0xFFFF the MPEG-2 video encoding process is supposed to be a “true VBR”. Conversely, if its value is different from 0xFFFF, the video encoding process is either CBR or VBR. In the “true VBR” case, the physical decoder buffer (or encoder VBV) receives data at the maximum rate until it becomes full, then the flow of data is stopped and no further data is lost. When the decoder extracts the bits related to one picture from the buffer, the flow of data always re-starts at the maximum rate (see Annex C.3.2 with the MPEG-2 recommendation cited in the foregoing). Other compression standards have parameters similar to vbv_delay in MPEG-2 to synchronize the decoding time.
U.S. Pat. No. 5,650,860 discloses in greater detail the management of VBV within the frameworks of MPEG-2 compression, however VBV management is not an object of this invention.
The MPEG-2 Test Model 5, briefly TM5 (see, for instance, ISO-IEC/JTC1/SC29/WG11, Test Model, Draft, April 1993 and S. Eckart, C. Fogg, MPEG-2 Encoder/Decoder, Version 1.2, July 1996. Copyright (c) 1996, MPEG Software Simulation Group. http://www.mpeg.org/MSSG/; ftp://ftp.mpeg.org/pub/mpeg/mssg/.), is the reference Video encoding system for MPEG-2 compression. The rate is a CBR and it is useful to set the common language for any CBR and VBR rate control methods.
Other documents of interest in this area are EP-A-1 005 233 (which corresponds to U.S. Pat. No. 6,215,820) and U.S. Pat. No. 5,801,779, U.S. Pat. No. 5,757,434, U.S. Pat. No. 5,986,712, U.S. Pat. No. 5,691,770, U.S. Pat. No. 5,835,149, U.S. Pat. No. 5,686,964, U.S. Pat. No. 5,949,490, U.S. Pat. No. 5,731,835.
Also of interest is WO-A-99/19664, that describes a CBR method that claims VBR image quality, or vice-versa, a VBR method that achieves bit-rate control accuracy like a CBR method.
Other VBR methods are disclosed, e.g., in N. Mohsenian, R. Rajagopalan, C. A. Gonzales, “Single-pass contant- and variable-bit-rate MPEG-2 video compression”, IBM J. RES. DEVELOP. vol. 43 no. 4, July 1999, Copyright 1999 by IBM Corporation, WO-A-99/38333, U.S. Pat. No. 5,650,860, and U.S. Pat. No. 6,192,075.
TM5 is a reference model proposed by the MPEG-2 Expert Group (see, e.g., ISO-IEC/JTC 1/SC29/WG 11, Test Model Long-term number 8 (TML-8) and the work by S. Eckart et al. already cited in the foregoing. FIG. 2 of the drawing annexed herewith describes the scheme of this CBR control model.
Generally speaking, the parameters of the encoding process should be set to get two bit-rate control objectives:
1) an output bit-rate that is constant and equal to a predefined one (the target bit-rate);
2) a local picture quality as constant as possible throughout the picture sequence.
Unfortunately, these objectives are conflicting in any CBR method: if too many bits are spent for past pictures, the control system must reduce the number of bits for the next pictures and these will have then a lower quality. A key parameter to solve this problem is mquant, which controls the trade-off between image quality and bit-rate.
In general terms, in the diagram of FIG. 2, input data ID corresponding to the bitrate/frame rate are fed to a module DBG that provides for the determination of the bits per GOP yielding an output parameter R. This is fed to a global target module GT which operates at the GOP level GL to generate a target value T as a function of initial parameters IP and a signal BF indicative of buffer fullness.
Target value T is processed at the picture level PL by feeding it, together with a current value for the effective bits EB (derived from the output bit-rate OBR), to a module BFU providing for the update of buffer fullness.
The output signal from module BFU is fed to a local control module LC which controls, at the macroblock level MBL, the generation of the parameter used in adaptive quantization (AQ) to generate mquant.
Concerning TM5, its CBR algorithm is organized in three steps:
i) target bit allocation. In this phase target bits Ti for the current picture i (with i=I, P or B) are decided. Allocation is effected at the start of picture coding with respect to complexity measures derived from past images of the same type;
ii) local control. From the state of “virtual buffer” di a reference quantization parameter, qi[n], is computed for each macro-block n, before MB quantization; and
iii) adaptive quantization. The mquant value is decided to correspond to the effective quantization parameter to be used for current macro-block, knowing qi[n] from the local control step and MB complexity (also named “activity”).
Concerning the target bit allocation or global control phase it is important to consider image sharpness, expressed in terms of complexity X: a detailed picture requires more bits to achieve a certain quality than a less complex one. The global complexity measure for picture i (with i=I, P or B type) can be computed as:Xi=Si·Qi  (C1)where Si is the effective number of bits that are used to encode an image of type i (with i=I, P or B) and Qi is the average mquant over the whole picture.
TM5 assumes that the complexity for pictures of the same type is constant over the sequence; hence, the bits needed for pictures of the same type are equal. The control system tries to obtain the same quality (sharpness) for each picture.
QI, QP, QB are parameters that can give an objective measure for spatial quality at a global level.
So, a constant ratio between these parameters is imposed:KIP=QP/QIKPB=QB/QP  (C2)
The TM5 CBR method uses KIP=1 and KPB=1.4.
Considering the current GOP, if NI, NP and NB are the number of pictures not yet encoded, R the number of remaining bits to be yet allocated and TI, TP and TB the estimated amount (or target) of bits that is needed for each picture of the GOP itself. Consequently:R=NI·TI+NP·TP+NB·TB  (C3)
At the beginning of each n-th GOP, R assumes the following value R(n):
                              R                      (            n            )                          =                              R                          (                              n                -                1                            )                                +                                                    (                                                      N                    I                                    +                                      N                    P                                    +                                      N                    B                                                  )                            ·              bitrate                        frame_rate                                              (        C4        )            that is the number of bits allowed for the current GOP number n, taking into account possible (positive or negative) remaining bits from previous (n−1) GOPs.
The target bits for each type of image are computed as:
                                                                        T                I                            =                              R                                  1                  +                                                                                    N                        P                                            ·                                              X                        P                                                                                                            X                        I                                            ·                                              K                        IP                                                                              +                                                                                    N                        B                                            ·                                              X                        B                                                                                                            X                        I                                            ·                                              K                        PB                                            ·                                              K                        IP                                                                                                                                                                                    T                P                            =                              R                                                      N                    P                                    +                                                                                    N                        B                                            ·                                              X                        B                                                                                                            X                        P                                            ·                                              K                        PB                                                                                                                                                                                    T                B                            =                              R                                                      N                    B                                    +                                                                                    N                        P                                            ·                                              K                        PB                                            ·                                              X                        P                                                                                    X                      B                                                                                                                              (        C5        )            where final the Ti's are chosen to be max(Ti, Tmin) and:Tmin=bitrate/(8 frame_rate)  (C6)
At the end of the current picture encoding step, R is modified by subtracting the actual number of bits Si generated for image i:R=R−Si  (C7)
A local control phase ensures that the bits spent after having encoded a picture will be the same number as decided by the Global Control. For this purpose three “virtual buffers” are used, one for each type of picture.
Before encoding MB n, buffer fullness is calculated according to the relationship:di[n]=di[0]+B(n−1)−(n−1)(Ti/numMB)  (C8)with i=I, P or B, while B(n−1) represents the number of bits generated by the encoding process until macro-block n. At the end of each picture, the virtual buffer fullness is updated:di=di+Si−Ti  (C9)where di represents the error between the number of bits Ti decided by the Global Control and effective encoding bits Si.
In Equation C8, bits spent until reaching the actual MB are related to a notional uniform distribution of target bits over all macro-blocks of the picture (numMB).
The reference quantization parameter qi[n] is calculated considering the virtual buffer fullness, as in the following:qi[n]=round(31 di[n]/r)  (C10)where the reaction parameter r is defined as:r=2 bitrate/frame_rate  (C11)
This is a proportional (local) controller that uses a non-realistic uniform model for the distribution of bits over the picture.
Additional rate-control-accuracy can be achieved by using Proportional-Integrative (PI) controllers, as reported in G. Keesman, I. Shah, R. Klein-Gunnewiek, “Bit-rate control for MPEG encoders”, Signal Processing: Image Communication, vol. 6, pp. 545-560, 1995.
In normal-life video pictures, there are highly sharp areas and quite uniform ones. Therefore, it is preferable to use a variable quantizer that follows changes in the local variance as a possible measure of the image local activity.
The activity of macro-block n, named act[n], will modulate the reference quantization parameter qi[n] to produce the final quantization step mquant[n], as shown in Equation C14 in the following.
Spatial activity act[n] is computed as the minimum of the variances (varu) of the four blocks in a macro-block either in field or frame mode (so that we get eight variance values):act[n]=1+min u=1 . . . 8 (varu)  (C12)
Then a normalized activity Nact[n] is computed as:Nact[n]=(2 act[n]+AvgAct)/(act[n]+2 AvgAct)  (C13)where AvgAct is the average value of act[n] over the last encoded picture.
It would be natural to think to encode more accurately the most detailed zones (high local activity) giving them more bits, while neglecting the uniform areas that carry less information. However, quantization noise is much more visible in uniform (low activity) areas, hence it is necessary to less coarsely quantize these ones. The expression used to produce mquant is:
                              mquant          ⁡                      [            n            ]                          =                                            q              i                        ⁡                          [              n              ]                                ·                      (                                                            2                  ·                                      act                    ⁡                                          [                      n                      ]                                                                      +                AvgAct                                                              act                  ⁡                                      [                    n                    ]                                                  +                                  2                  ·                  AvgAct                                                      )                                              (        C14        )            
After the three steps of TM5 CBR control, mquant is weighted by visibility matrices (that can change at picture level) before actually quantizing each coefficient (by stage Q of FIG. 1). A visibility matrix sets the relative coarseness of quantization allowing a larger quantization error at higher frequencies where human eye is less sensitive. The TM5 matrix to quantize DCT coefficients of Intra Macro-Blocks is reported in the following Table 1. The DC coefficient is uniformly quantized with weight 8,4,2,1, depending on the desired accuracy.
TABLE 18161922262729341616222427293437192226272934343822222627293437402226272932354048262729323540485826272934384656692729353846566983
The AC coefficients AC(i, j) are weighted according to the following relationship:c(i; j)=(32·AC(i, j)+wI(i, j)/2)/wI(i, j)  (C15)where wI (i, j) is the component (i, j) of the matrix reported in Table 1. The quantized coefficient QAC(i, j) depends on the mquant (quantizer scale) parameter:AC(i, j)=(ac(i, j)+(3·mquant+2)/4)/(2·mquant)  (C16)
The computed value QAC(i, j) is then in any case clipped in the range [−2048; 2047].
The TM5 CBR control method is implemented with a simple proportional controller based on virtual buffer state as given by Equations C8, C9 and C10.
A problem with this algorithm is the ideal uniform distribution of target bits over macro-blocks. If a picture is considered divided in two parts, the upper part being uniform (lower activity) and the lower part being full of details, TM5 would try to give the upper area half of bits available, even if there are few DCT coefficients. As a consequence, bits left for the bottom part of the picture would not be enough, so TM5 would use a coarse quantization with a quality deterioration of that image part, to maintain constant the output bit-rate.
Another drawback of the TM5 algorithm is a poor behavior in case of scene changes. In fact, during the Target Bit Allocation step, the global complexity is obtained from previously encoded pictures of the same type. This could be a problem for I-frames because the last I-picture may be more than ten images earlier (our GOP is composed of a number of frames at least greater than 12). To improve performance in these situations it is necessary to use a pre-analysis that gives more recent information on the global complexity.
CBR control methods based on pre-analysis can produce an output bit-rate very close to the desired one. They use information from a pre-analysis of the current picture, Where such pre-analysis is a complete encoding of the image with a constant quantizer, as reported in the work by Keesmann et al. already cited in the foregoing. CBR controls with pre-analysis employ a P-I controller in the local control phase.
In the case of slice pre-analysis, some information deriving from past pre-analysis or encoding is necessary to compensate the limited amount of knowledge achieved by pre-encoding only the current slice. Pre-analysis is done over only one slice to reduce:
i) the memory capacity to store pre-encoding information that will be used during the real encoding process (a smaller capacity means less silicon area and thus a lower cost);
ii) the processing delay, because only one slice is pre-encoded (pre-analyzed) at a time.
The solution disclosed in WO-A-99/49664 tries to solve the key problems faced by any CBR controller, namely:
i) non-uniform distortion within a sequence (and even within a picture) gives rise to an output with non-uniform visual quality;
ii) bit allocation methods are inefficient, especially for sequence containing scenes of varying complexity; and
iii) poor performance at scene changes.
The method of WO-A-99/49664 was designed to:
i) keep the reference quantization parameter, qi[n] (I/P/B frame individually) uniform within a sub-sequence of pictures (named “segment”) with similar complexity in order to achieve uniform output quality (that is, qi[n]=Qpi=constant);
ii) dynamically change the bit-rate of the encoder, according to the complexity of the picture being coded, with information fed back via a closed control loop;
iii) improve encoder performance at scene changes;
iv) efficiently allocate bits according to picture type and complexity; and
v) strike a balance between adaptation rate of the encoder to changes in picture complexity and gradual degradation capability.
The VBR method reported in WO-A-99/38333 is based either on MSE or PSNR or SNR computation to update the target bit-rate. The method uses the same Target Bit Allocation phase of TM5 CBR.
The CBRNBR methods reported in N. Mohsenian, R. Rajagopalan, C. A. Gonzales, “Single-pass contant- and variable-bit-rate MPEG-2 video compression”, IBM J. RES. DEVELOP. vol. 43 no. 4, July 1999, Copyright 1999 by IBM Corporation and U.S. Pat. No. 5,650,860 use relationship between bit-rate and Qi,. Such curve, named “rate-distortion” in W. Ding, B. Liu, “Rate Control of MPEG video coding and recording by Rate-Quantization modeling”, IEEE Transactions on Circuits and Systems for Video Technology, vol. 6, no.1, February 1996, can be experimentally modeled in several ways, either linear or non-linear, depending on the approximations applied and related computational complexity required. Also, the methods of WO-A-99/38333 and WO-A-99/49664 employ rate-distortion curves and therefore are exposed to the same problems.
In fact, to be very precise, U.S. Pat. No. 5,650,860 uses a slightly misleading terminology, by designating the “rate-distortion” as “bit budget” within a context that has nothing to share with the. proper meaning of this latter term.
The VBR method reported in U.S. Pat. No. 5,650,860 is a non-real time multi-pass encoder: the same GOP is encoded several times with different strategies to gather information about the way bits can be used in the final effective encoding; basically this represents a very sophisticated and expensive pre-analysis, to be used only on professional high-end equipments. Finally, U.S. Pat. No. 5,650,860 discloses CBR and VBR methods that are heavily integrated with VBV management and are consequently depending thereon.
Both the methods of the work of Mohsenian et al. and WO-A-99/38333 are very complex: the first due to its PSNR computation, the second due to the minimization of a Lagrangian cost function (through Lagrange multipliers method).
The VBR method reported in U.S. Pat. No. 6,192,075 uses a relationship for “excessive bit usage”. This relationship is not exploited in terms of zones, slopes and related Look-Up-Tables to update QPi. Conversely such a relationship is just used to set the target bits for the current GOP and to clip the effective bit-rate to the target bit-rate. Furthermore, the rate control action uses virtual buffers and proportional-controller similar to TM5 (see Equations C8 and C9 and C10), thus being exposed to all the limitations of TM5.
A number of problems related to CBR and VBR methods have been already indicated in the foregoing.
Even though providing advantages over a CBR method, the solution of WO-A-99/49664 limits dramatically either positive or negative variations of effective bit-rate, in order to achieve an effective bit-rate very close to the target value. This behavior also negatively affects the QPi variance and average values that are both higher than those for TM5. Higher QPi variances and average values indicate a worst image quality from an objective point of view.
Furthermore, the experimental “rate-distortion” curves that model the bit-rate vs QPi relationship in the solutions of WO-A-99/38333, U.S. Pat. No. 5,650,860, WO-A-99/49664 and the work of Mohsenian et al. are strongly dependent on image complexity (or activity). The results obtained by using such curves is sub-optimal and requires accurate modeling that is very expensive in terms of computation.
The best option would certainly be to re-compute these curves several times at any frame, as done in the work of Ding et al., but this is prohibitive in terms of (extremely high) CPU computation, (longer) processing delay, (larger) memory capacity, for products targeting consumer application markets.
Last but not least, both the solutions of WO-A-99/38333 and WO-A-99/49664 are derived from TM5 CBR.
In particular they even use the same TM5 target bit allocation phase. Therefore they inherently suffer—although partially—from the same limitations of TM5.