1. Field of the Invention
The present invention relates to a video compression method, and more particularly, to a method and apparatus of improving the compression efficiency of a motion vector by efficiently predicting a motion vector in an enhancement layer from a motion vector in a base layer in a video coding method using a multi-layer structure.
2. Description of the Related Art
With the development of information communication technology, including the Internet, video communication as well as text and voice communication, has increased dramatically. Conventional text communication cannot satisfy users' various demands, and thus, multimedia services that can provide various types of information such as text, pictures, and music have increased. However, multimedia data requires storage media that have a large capacity and a wide bandwidth for transmission since the amount of multimedia data is usually large. Accordingly, a compression coding method is requisite for transmitting multimedia data including text, video, and audio.
A basic principle of data compression is removing data redundancy. Data can be compressed by removing spatial redundancy in which the same color or object is repeated in an image, temporal redundancy in which there is little change between adjacent frames in a moving image or the same sound is repeated in audio, or mental visual redundancy which takes into account human eyesight and its limited perception of high frequency. In general video coding, temporal redundancy is removed by motion compensation based on motion estimation and compensation, and spatial redundancy is removed by transform coding.
To transmit multimedia generated after removing data redundancy, transmission media are necessary. Transmission performance is different depending on transmission media. Currently used transmission media have various transmission rates. For example, an ultrahigh-speed communication network can transmit data of several tens of megabits per second while a mobile communication network has a transmission rate of 384 kilobits per second. Accordingly, to support transmission media having various speeds or to transmit multimedia at a data rate suitable to a transmission environment, data coding methods having scalability, such as wavelet video coding and subband video coding, may be suitable to a multimedia environment.
Scalability indicates the ability for a decoder part or a pre-decoder part to partially decode a single compressed bitstream according to conditions such as a bit rate, error rate, system resources or the like. A decoder or a pre-decoder decompresses only a portion of a bitstream coded by scalable coding and plays back the same to be restored into multimedia sequences having different video quality/resolution levels or frame rates.
FIG. 1 is a schematic diagram of a typical scalable video coding system. First, an encoder 50 codes an input video 51, thereby generating a bitstream 52. A pre-decoder 60 can extract different bitstreams 53 by variously cutting the bitstream 52 received from the encoder 50 according to an extraction condition, such as a bit rate, a resolution, or a frame rate, and as related with an environment of communication with a decoder 70 or mechanical performance of the decoder 70. Typically, the pre-decoder 60 is implemented to be included in a video stream server providing variable video streams to an end-user in variable network environments.
The decoder 70 reconstructs an output video 54 from the extracted bitstream 53. Extraction of a bit stream according to the extraction condition may be performed by the decoder 70 instead of the pre-decoder 60 or may be performed by both of the pre-decoder 60 and the decoder 70.
MPEG-4 (Motion Picture Experts Group 4) Part 13 standardization for scalable video coding is under way. In particular, much effort is being made to implement scalability based on a multi-layered structure. For example, a bitstream may consist of multiple layers, i.e., base layer and first and second enhancement layers with different resolutions (QCIF, CIF, and 2CIF) or frame rates.
Like when a video is encoded into a singe layer, when a video is encoded into multiple layers, a motion vector (MV) is obtained for each of the multiple layers to remove temporal redundancy. The motion vector MV may be separately searched for each layer (former approach) or a motion vector obtained by a motion vector search for one layer is used for another layer (without or after being upsampled/downsampled) (latter approach). The former approach has the advantage of obtaining accurate motion vectors while suffering from overhead due to motion vectors generated for each layer. Thus, it is a very challenging task to efficiently reduce redundancy between motion vectors for each layer.
FIG. 2 shows an example of a scalable video codec using a multi-layered structure. Referring to FIG. 2, a base layer has a quarter common intermediate format (QCIF) resolution and a frame rate of 15 Hz, a first enhancement layer has a common intermediate format (CIF) resolution and a frame rate of 30 Hz, and a second enhancement layer has a standard definition (SD) resolution and a frame rate of 60 Hz. For example, to obtain a stream of CIF and 0.5 Mbps, the enhancement layer bitstream of CIF—30 Hz—0.7M may be truncated to meet the bit-rate of 0.5 M. In this way, it is possible to implement spatial, temporal, and SNR scalabilities. Because about twice as much overhead as that generated for a singe-layer bitstream occurs due to an increase in the number of motion vectors as shown in FIG. 2, motion prediction from the base layer is very important. Of course, since the motion vector is used only for an inter-frame coded by referring to neighboring frames, it is not used for an intra-frame coded without reference to adjacent frames.
As shown in FIG. 2, frames 10, 20, and 30 in the respective layers having the same temporal position can be estimated to have similar images thus similar motion vectors. Thus, one of the currently used methods for efficiently representing a motion vector includes predicting a motion vector for a current layer from a motion vector for a lower layer and coding a difference between the predicted value and the actual motion vector.
FIG. 3 is a diagram for explaining a conventional method for efficiently representing a motion vector using motion prediction. Referring to FIG. 3, a motion vector in a lower layer having the same temporal position as a current layer has conventionally been used as a predicted motion vector for a current layer motion vector.
An encoder obtains motion vectors MV0, MV1, and MV2 for a base layer, a first enhancement layer, and a second enhancement layer at predetermined accuracies and performs temporal transformation using the motion vectors MV0, MV1, and MV2 to remove temporal redundancies in the respective layers. However, the encoder sends the base layer motion vector MV0, a first enhancement layer motion vector component D1, and a second enhancement layer motion vector component D2 to the pre-decoder (or video stream server). The pre-decoder may transmit only the base layer motion vector, the base layer motion vector and the first enhancement layer motion vector component D1, or the base layer motion vector, the first enhancement layer motion vector component D1 and the second enhancement layer motion vector component D2 to a decoder to adapt to network situations.
The decoder then uses the received data to reconstruct a motion vector for an appropriate layer. For example, when the decoder receives the base layer motion vector and the first enhancement layer motion vector component D1, the first enhancement layer motion vector component D1 is added to the base layer motion vector MV0 in order to reconstruct the first enhancement layer motion vector MV1. The reconstructed motion vector MV1 is used to reconstruct texture data for the first enhancement layer.
However, when the current layer has a different frame rate than the lower layer as shown in FIG. 2, a lower layer frame having the same temporal position as the current frame may not exist. For example, because a layer frame lower than a frame 40 is not present, motion prediction through a lower layer motion vector cannot be performed. That is, since a motion vector in the frame 40 cannot be predicted, a motion vector in the first enhancement layer is inefficiently represented as a redundant motion vector.