The invention is directed a novel system and method for the optimal quantization of transform coefficients with minimal bit rate overhead by a novel method of reduced temporal resolution update.
The basic essence of a video transmission is a sequence of pictures transmitted at a relatively fixed time sequence for reproduction at a receiving site. For digital transmissions, such sequences of pictures are transmitted in the form of a digital bit stream that is stored at the receiving site and reproduced in some form. In practice, such digitized video transmissions have accompanying audio that together adding up to a large amount of data. The video and audio data can occupy a vast amount of storage space and of transmission bandwidth.
In order to save transmission bandwidth and storage space, video and audio data are compressed at the transmission end, and decompressed at the receiving end. Video compression typically involves taking the differences between adjacent pictures in a stream of pictures or frames and then coding most frames as differences relative to neighboring pictures. This may be done in several ways through the process of motion estimation and compensation by the encoder, and motion compensation at the decoder. An encoder at the beginning of the transmission process is required to determine the way in which a picture is compressed, solely at its own discretion. This is done frequently through code sequences represented by a long decision tree. In contrast, the decoder at the receiving end is configured to merely perform decoding operations according to discrete operational processes performed by the encoder, or “does what it is told to do.” To serve as a basis of prediction of other frames and to provide functionalities such as random access to the compressed bitstream, in addition to the first frame, the encoder will occasionally encode input video frames independent of other frames. Such frames are termed “Intra” coded frames. In contrast, other frames that are encoded as the difference between the input and the motion compensated predicted information are termed “Inter” coded frames. Encoder sometimes uses information from “future” frames in a sequence of frames to decode current frames. Thus, the coding order, the order in which compressed frames are transmitted, is not the same as the display order, which is the order in which the frames are presented to a viewer. Frames encoded with reference to both future and past frames are termed “B” (B-directional) frames.
MPEG, such as MPEG-2, MPEG-4 and H.264/AVC, is a standard specifically engineered as a hybrid coding for intra frame/inter-frame (motion) compression of video sequences. FIGS. 1 and 2 illustrate, respectively, a group of pictures in display order in FIG. 1 and in coding order in FIG. 2. In FIGS. 1 and 2 “I” represents intra coded frames, “B” represents bidirectionally predicting coded pictures, and “P” represents forward predicting coded pictures. FIG. 3 illustrates the use of a forward prediction reference pictures and backward prediction reference pictures to generate a current picture. Specifically, FIG. 3 illustrates motion compensation, that is, how future pictures are predicted from subsequent pictures (and future pictures). If motion occurs in a sequence of frames, prediction is carried out by coding differences relative to areas that are shifted with respect to the area being coded. This is known as “motion compensation,” and the process of determining the motion vectors is called “motion estimation.” The resulting motion vectors, describing the direction and amount of motion of a macroblock, are stored and transmitted to the decoder as part of the compressed bitstream. In operation, the decoder uses the origin and length of the motion vector to reconstruct the frame.
In coding a single frame, the basic building block is the macroblock. Typically, the macroblock is a 16×16 sample array of luminance (gray scale) samples together with one 8×8 block of samples for each of the two chrominance (color) components. Next in the hierarchy is what is known as the “slice,” a group of macroblocks in a given scan order. The slice starts at a specific address or position in the picture, and the address (in H.264/AVC the scan pattern is signaled) is specified in a slice header.
Intercoding and intracoding are both built on the Discrete Cosine Transform (hereinafter the “DCT”) or DCT like integer transformation, representing the prediction error after motion compensation (in the case of Inter coding) or the input signal itself (in the case of Intra coding, in H.264/AVC spatial prediction before the transform) as a linear combination of spatial frequencies. Each spatial frequency pattern has a corresponding transform coefficient, that is, the amplitude needed to represent the contribution of the specific spatial frequency to the block of data being represented.
DCT coefficients are then quantized by a scalar quantizer via division by a non-zero “quantization step size” and thereafter either truncating the quantized DCT coefficient or rounding the quantized DCT quotient to the nearest integer, termed quantization levels. At the decoder, the inverse operation (“de-quantization”) is performed by multiplying the quantization level by the same quantization step size used by the encoder. Both the quantization step size and the quantization levels for each DCT coefficient are signaled in the compressed bitstreams. The reconstruction values, as determined by the above processed, will always be a multiple of the quantization step size of the corresponding coefficient used by the encoder.
It is to be noted that, the larger the quantization value, the lower the precision of the quantized DCT coefficient, and the smaller the quantization level. Physiologically, large quantization values for high spatial frequencies allow the encoder to discard high frequency activity that are of lower perceptibility to the human eye. This saves bandwidth and storage space by discarding data that cannot be detected by the human eye.
FIG. 4 shows basic encoding structure of the existing coding standards such as H.263[1], MEPG-2[2], MPEG-4[3] and H.264/AVC[4]. The De-blocking Filter exists only in H.263 and H.264/AVC. In H.264/AVC, the Intra-frame prediction is performed in pixel domain. The shaded area is equivalent to the decoder.
In low bit rate video applications, frame dropping (FD) is commonly used in video encoding as a compromise between temporally and spatially perceived quality, i.e. it increases the bit budget for individual coded frames therefore produces better quality for the frame at the expense of lowered frame rate after compression, and the resulted unsmooth, “jerky” motion. For some applications, such artifacts are not allowed. For example, MPEG-2 does not allow frame dropping.
Annex Q of the H.263 standard, Reduced-Resolution Update (RRU), shows a method that reduces the bitrate required for temporal and spatial update by reducing the spatial resolution of prediction errors. Then DCT, quantization and entropy coding are all performed on the resolution-reduced prediction errors, thereby, to a large extent, removing a need of dropping frames and meeting the bitrate requirement. In exchange, it will lose some high frequency texture details. FIG. 5 shows the block diagram of the block decoding in the Reduced-Resolution Update mode, Annex Q in H.263. The Pseudo Vector and the Result of the Inverse Transform are the process steps that require scaling processing.
Unfortunately, except for H.263, none of the existing standards including MPEG-2, MPEG-4 and H.264/AVC, supports RRU. Even for H.263 products, RRU can only be employed when both the encoder and the decoder, which are often provided by different manufacturers, support Annex Q. This is not desirable, because many existing decoding products do not support RRU.
Therefore, it is highly desirable to design a coding system that can benefit from the capability of differentiated quantization for transform coefficients and control of preserving high frequency details without the need for special processing by decoders at the receiving destinations. As will be seen, the invention accomplishes this while obviating the need for changing the standard syntax, and further overcomes the shortcomings of the prior art in an elegant manner.