1. Field of the Invention
Apparatuses and methods consistent with the present invention relate to video compression. More particularly, the present invention relates to a method and an apparatus for realizing signal to noise ratio (SNR) scalability in a video stream server in order to transmit a video stream in a variable network environment.
2. Description of the Related Art
Development of communication technologies such as the Internet has led to an increase in video communication in addition to text and voice communication. However, consumers have not been satisfied with existing text-based communication schemes. To satisfy various consumer needs, services for multimedia data containing text, images, music and the like have been increasingly provided. Multimedia data is usually voluminous and requires a large capacity storage medium. Also, a wide bandwidth is required for transmitting the multimedia data. For example, digitizing one frame of a 24-bit true color image with a resolution of 640×480 requires 640×480×24 bits, that is, 7.37 mega bits (Mbits). Accordingly, a bandwidth of approximately 221 Mbits per second is needed to transmit this data at the rate of 30 frames per second, and a storage space of approximately 1,200 giga bits (Gbits) is needed to store a 90-minute movie. Taking this into consideration, it is required to use a compressed coding scheme when transmitting multimedia data.
A basic principle of data compression is to eliminate redundancy in the data. The three types of data redundancy are: spatial redundancy, temporal redundancy, and perceptual-visual redundancy. Spatial redundancy refers to the duplication of identical colors or objects in an image, temporal redundancy refers to little or no variation between adjacent frames in a moving picture or successive repetition of the same sounds in audio, and perceptual-visual redundancy refers to the limitations of human vision and the inability to hear high frequencies. By eliminating these redundancies, data can be compressed. Data compression types can be classified into loss/lossless compression depending upon whether source data is lost, intraframe/interframe compression depending upon whether data is compressed independently relative to each frame, and symmetrical/asymmetrical compression depending upon whether the same amount of time is taken to decompress as it is to compress. In addition, when a total end-to-end delay time in compression and decompression does not exceed 50 ms, this is referred to as real-time compression. When frames have a variety of resolutions, this is referred to as scalable compression. Lossless compression is mainly used in compressing text data or medical data, and lossy compression is mainly used in compressing multimedia data. Intraframe compression is generally used for eliminating spatial redundancy and interframe compression is used for eliminating temporal redundancy.
Transmission media to transmit multimedia data have different capacities. Transmission media in current use have a variety of transmission speeds, covering ultra-high-speed communication networks capable of transmitting data at a rate of tens of Mbits per second, mobile communication networks having a transmission speed of 384 kilo bits (Kbits) per second and so on. In conventional video encoding algorithms, e.g., MPEG-1, MPEG-2, MPEG-4, H.263 and H.264 (Advanced Video Coding), temporal redundancy is eliminated by motion compensation, and spatial redundancy is eliminated by spatial transformations. These schemes have good performance in compression but they have little flexibility for a true scalable bit-stream because main algorithms of the schemes employ recursive approaches.
For this reason, research has been focused recently on wavelet-based scalable video coding. Scalable video coding refers to video coding having scalability in a spatial domain, that is, in terms of resolution. Scalability has the property of enabling a compressed bit-stream to be decoded partially or in advance, whereby videos having a variety of resolutions can be played.
The term “scalability” herein is used to collectively refer to spatial scalability for controlling the resolution of a video, signal-to-noise ratio (SNR) scalability for controlling the quality of a video, and temporal scalability for controlling the frame rates of a video, and combinations thereof.
As described above, the spatial scalability may be implemented based on the wavelet transformation. Also, temporal scalability has been implemented using motion compensated temporal filtering (MCTF) and unconstrained MCTF (UMCTF). SNR scalability may be implemented based on the embedded quantization coding scheme that considers spatial correlation or on the fine granular scalability (FGS) coding scheme used for MPEG series codecs.
An overall construction of a video coding system to support scalability is depicted in FIG. 1. A video encoder 45 encodes an input video 10 through temporal filtering, spatial transformation, and quantization to thereby generate a bit-stream 20. A pre-decoder 50 may implement a variety of scalabilities relative to texture data in a simple manner by truncating or extracting a part of the bit-stream 20 received from the video encoder 45. Picture quality, resolution or frame rate may be considered for the truncating. The process of implementing the scalability by truncating a part of the bit-stream is called “pre-decoding.”
The video decoder 60 reconstructs the output video 30 from the pre-decoded bit-stream 25 by inversely performing the processes conducted by the video encoder 45. Pre-decoding of the bit-stream according to pre-decoding conditions is not necessarily conducted by the pre-decoder 50. When it is difficult to process the whole video of the bit-stream 20 generated at the video encoder 45 side in real time because of insufficient processing capability of the video encoder 60, the bit-stream may be pre-decoded at the video decoder 60 side.
Standardization with respect to video coding technologies to support scalability is under development in the moving picture experts group-21 (MPEG-21) PART-13. Especially, there have been many attempts to implement multi-layered video coding methods. A multi-layer may comprise a base layer, a first enhancement layer and a second enhancement layer, and each layer has different resolutions (QCIF, CIF 2CIF) or different frame rates.
FIG. 2 illustrates an example of a scalable video codec using a multi-layer structure. A base layer is defined in the quarter common intermediate format (QCIF) having a frame rate of 15 Hz, a first enhancement layer is defined in the common intermediate format (CIF) having a frame rate of 30 Hz, and a second enhancement layer is defined as a standard definition (SD) having a frame rate of 60 Hz. When a stream of CIF 0.5M is required, the bit-stream of the first enhancement layer (CIF-30 Hz@0.7M) can be pre-decoded. In this manner, spatial, temporal and SNR scalabilities may be implemented. Since there exist similarities between textures and motion vectors of each layer, redundancies of each layer are generally removed when encoding a plurality of layers. The layers illustrated in FIG. 2 have different resolutions and frame rates. However, there may exist layers having the same resolution but different frame rates, or having the same frame rate but different resolutions.
A conventional method to implement the SNR scalability at the pre-decoder 50 side is as illustrated in FIG. 3. A bit-stream 20 generated by a video encoder consists of a plurality of group of pictures (GOPs), and each GOP consists of a plurality of frame information. Frame information 40 consists of a motion component 41 and a texture component 42. The pre-decoder 50 determines a transmissible bitrate according to the bandwidth of the network connected to the decoder side, and truncates a part of the original texture component 42 based on the determined bitrate. The texture component left after truncating the original texture component 42, that is, the texture component 43 pre-decoded based on the SNR, and the motion component 41 are transmitted to the video decoder side.
Since this texture component is encoded by a method to support SNR scalability, the SNR scalability can be implemented by a simple operation to truncate a part of the texture component backward. Encoding methods to support SNR scalability are: fine granular scalability (FGS) coding used in codecs of the MPEG series, and embedded quantization coding used in codecs of the wavelet series. The bit-stream generated by the embedded quantization has an additional merit: it can be pre-decoded finer than the bit-stream generated by the FGS coding.
However, because of overhead due to the motion information and a structural problem of multi-layered video coding, the bit-stream may not approach a target bitrate desired by a user when the SNR changes in a layer. In this case, if the quality of the picture is degraded because of excessive truncation of data or the bit-stream is transmitted as it is because there is no bit to be further truncated, this may cause a network delay in real-time streaming. Therefore, there is a need for a pre-decoding method to solve this problem.