1. Field of the Invention
The present invention relates to a video coding technology, and more particularly, to a method of controlling a bit rate of a bitstream composed of a plurality of quality layers.
2. Description of the Related Art
With the development of information and communication technologies, multimedia communications are increasing in addition to text and voice communications. The existing text-based communication systems are insufficient to satisfy consumers' diverse needs, and thus multimedia services that can accommodate diverse forms of information, such as text, image, music, and others, are increasing. Since multimedia data is large, mass storage media and wide bandwidths are respectively required for storing and transmitting it. Accordingly, compression coding techniques are required to transmit the multimedia data.
The basic principle of data compression is to remove data redundancy. Data can be compressed by removing spatial redundancy such as a repetition of the same color or object in images, temporal redundancy such as similar neighboring frames in moving images or continuous repetition of sounds, and visual/perceptual redundancy which considers human insensitivity to high frequencies. In a general video coding method, temporal redundancy is removed by temporal filtering based on motion compensation, and spatial redundancy is removed by a spatial transform.
In order to transmit multimedia data after the data redundancy is removed, transmission media are required, the performances of which differ. Presently used transmission media have diverse transmission speeds. For example, an ultrahigh-speed communication network can transmit several tens of megabits of data per second and a mobile communication network has a transmission speed of 384 kilobits per second. In order to support the transmission media in such a transmission environment, and to transmit multimedia with a transmission rate suitable for the transmission environment, a scalable video coding method is most suitable.
The scalable video coding method is a coding method that can adjust a video resolution, a frame rate, and a signal-to-noise ratio (SNR), that is, a coding method that supports diverse scalabilities by truncating a part of a compressed bitstream in accordance with peripheral conditions such as a transmission bit rate, a transmission error rate, and system resources.
In the current scalable video coding (SVC) standard, expedited by the Joint Video Team (JVT), which is a joint working group of Moving Picture Experts Group (MPEG) and International Telecommunication Union (ITU), is based on H.264. The SVC standard contains fine granularity scalability (FGS) technology for supporting SNR scalability.
FIG. 1 shows an example of a scalable video codec using a multi-layer structure. Referring to FIG. 1, a first layer has a Quarter Common Intermediate Format (QCIF) resolution and a frame rate of 15 Hz, a second layer has a Common Intermediate Format (CIF) resolution and a frame rate of 30 Hz, and a third layer has a Standard Definition (SD) resolution and a frame rate of 60 Hz.
A layer correlation may be used for encoding multi-layer video frames that have various resolutions and/or frame rates. For example, an area 12 of a first enhancement layer frame is efficiently encoded through a prediction from an area 13, corresponding to the area 12, of a base layer frame. An area 11 of a second enhancement layer frame may be efficiently encoded through a prediction using the area 12.
FIG. 2 is a schematic diagram for explaining inter prediction and intra-base prediction of a scalable video coding method. A block 24 in a current layer frame 21 may be predicted with reference to a block 25 in another current layer frame 22, which is called inter prediction. The inter prediction includes motion estimation for obtaining a motion vector showing a corresponding block.
The block 24 may be predicted with reference to a block 26 in the low layer (base layer) frame 23 that locates at the same temporal position and picture order count (POC) as the frame 21, which is called an intra-base prediction. In the intra-base prediction, motion estimation is not required.
FIG. 3 illustrates an example of applying FGS to a residual picture through the prediction of FIG. 2. The residual picture 30 may be represented as a plurality of quality layers in order to support SNR scalability. These quality layers are needed to diversely express a video quality, which is different from the layer for resolutions and/or frame rates.
The plurality of quality layers may consist of one discrete layer 31 and at least one of FGS layers 32, 33 and 34. The video quality measured in the video decoder is the lowest when only a discrete layer 31 is received, followed by when the discrete layer 31 and a first FGS layer 32 are received, when the discrete layer 31 and the first and a second FGS layers 32 and 33, and when all layers 31, 32, 33 and 34 are received.
FIG. 4 illustrates a process of expressing a single picture or slice as one discrete layer and two FGS layers.
An original picture (or slice) 41 is quantized by a first quantization parameter QP1 (S1). The quantized picture 42 forms a discrete layer. The quantized picture 42 is inverse-quantized (S2), and provided to a subtractor 44. The subtractor 44 subtracts the provided picture 43 from the original picture 41 (S3). The subtracted result is quantized again using a second quantization parameter QP2 (S4). The quantized result 45 forms the first FGS layer.
The quantized result 45 is inverse-quantized (S5), and provided to an adder 47. The provided picture 46 and the provided picture 43 are added by the adder 47 (S6), and provided to a subtractor 48. The subtractor 48 subtracts the added result from the original picture 41 (S7). The subtracted result is quantized again using a third quantization parameter QP3 (S8). The quantized result 49 forms the second FGS layer.
Through the above operations, the plurality of quality layers as illustrated in FIG. 3 can be formed.
FIGS. 5A and 5B illustrate the quality layer truncating method used in the current SVC standard. As illustrated in FIG. 5A, a current picture 30 is expressed as a residual picture by being predicted from a reference picture 35 through the inter prediction or the intra-base prediction. The current picture 30 expressed as the residual picture consists of a plurality of quality layers 31, 32, 33 and 34. The reference picture 35 also consists of a plurality of quality layers 36, 37, 38 and 39.
According to the current SVC standard, a bitstream extractor truncates a part of quality layers in order to control SNR bitstreams as illustrated in FIG. 5B. That is, the bitstream extractor truncates quality layers of the current picture 30 that is located in high resolution and/or frame rate layer (hereinafter, referred to as “layer” to distinguish it from the “quality layer”) from the highest and downward. After all the quality layers of the current picture 30 are truncated, quality layers of the reference picture 35 are truncated from the highest and downward.
The above truncation is best for reconstructing a picture (reference picture) of a lower layer (e.g., QCIF), but is not best for reconstructing a picture (current picture) of a high layer (e.g., CIF). Quality layers of some low layer pictures may be less important than those of high layer pictures. Accordingly, it is required that efficient SNR scalability be embodied by truncating quality layers according to whether a video encoder mainly aims at a high-layer picture or a low-layer picture.