1. Field of the Invention
The present invention relates to scalable encoding and decoding of a video signal.
2. Description of the Related Art
It is difficult to allocate high bandwidth, required for TV signals, to digital video signals wirelessly transmitted and received by mobile phones and notebook computers. It is expected that similar difficulties will occur with mobile TVs and handheld PCs, which will come into widespread use in the future. Thus, video compression standards for use with mobile devices should have high video signal compression efficiencies.
Such mobile devices have a variety of processing and presentation capabilities so that a variety of compressed video data forms should be prepared. This means that a variety of different quality video data with different combinations of a number of variables such as the number of frames transmitted per second, resolution, and the number of bits per pixel should be provided based on a single video source. This imposes a great burden on content provides.
Because of the above, content providers prepare high-bitrate compressed video data for each source video and perform, when receiving a request from a mobile device, a process of decoding compressed video and encoding it back into video data suited to the video processing capabilities of the mobile device. However, this method entails a transcoding procedure including decoding, scaling, and encoding processes, which causes some time delay in providing the requested data to the mobile device. The transcoding procedure also requires complex hardware and algorithms to cope with the wide variety of target encoding formats.
The Scalable Video Codec (SVC) has been developed in an attempt to overcome these problems. This scheme encodes video into a sequence of pictures with the highest image quality while ensuring that part of the encoded picture (frame) sequence (specifically, a partial sequence of frames intermittently selected from the total sequence of frames) can be decoded to produce a certain level of image quality.
Motion Compensated Temporal Filtering (MCTF) is an encoding scheme that has been suggested for use in the Scalable Video Codec. The MCTF scheme has a high compression efficiency (i.e., a high coding efficiency) for reducing the number of bits transmitted per second. The MCTF scheme is likely to be applied to transmission environments such as a mobile communication environment where bandwidth is limited.
Although it is ensured that part of a sequence of pictures encoded in the scalable MCTF coding scheme can be received and processed to video with a certain level of image quality as described above, there is still a problem in that the image quality is significantly reduced if the bitrate is lowered. One solution to this problem is to provide an auxiliary picture sequence for low bitrates, for example, a sequence of pictures that have a small screen size and/or a low frame rate. One example is to encode and transmit not only a main picture sequence of 4CIF (Common Intermediate Format) but also an auxiliary picture sequence of CIF and an auxiliary picture sequence of QCIF (Quarter CIF) to decoders. Each sequence is referred to as a layer, and the higher of two given layers is referred to as an enhanced layer and the lower is referred to as a base layer.
More often, the auxiliary picture sequence is referred to as a base layer (BL), and the main picture sequence is referred to as an enhanced or enhancement layer. Video signals of the base and enhanced layers have redundancy since the same video content is encoded into two layers with different spatial resolution or different frame rates. To increase the coding efficiency of the enhanced layer, a video signal of the enhanced layer may be predicted using motion information and/or texture information of the base layer. This prediction method is referred to as inter-layer prediction.
FIG. 11 illustrates examples of an intra BL prediction method and an inter-layer residual prediction method, which are inter-layer prediction methods for encoding the enhanced layer using the base layer.
The intra BL prediction method uses a texture (or image data) of the base layer. Specifically, the intra BL prediction method produces predictive data of a macroblock of the enhanced layer using a corresponding block of the base layer encoded in an intra mode. The term “corresponding block” refers to a block which is located in a base layer frame temporally coincident with a frame including the macroblock and which would have an area covering the macroblock if the base layer frame were enlarged by the ratio of the screen size of the enhanced layer to the screen size of the base layer. The intra BL prediction method uses the corresponding block of the base layer after enlarging the corresponding block by the ratio of the screen size of the enhanced layer to the screen size of the base layer through upsampling.
The inter-layer residual prediction method is similar to the intra BL prediction method except that it uses a corresponding block of the base layer encoded so as to contain residual data, which is data of an image difference, rather than a corresponding block of the base layer containing image data. The inter-layer residual prediction method produces predictive data of a macroblock of the enhanced layer encoded so as to contain residual data, which is data of an image difference, using a corresponding block of the base layer encoded so as to contain residual data. Similar to the intra BL prediction method, the inter-layer residual prediction method uses the corresponding block of the base layer containing residual data after enlarging the corresponding block by the ratio of the screen size of the enhanced layer to the screen size of the base layer through upsampling.
A base layer with lower resolution for use in the inter-layer prediction method is produced by downsampling a video source. Corresponding pictures (frames or blocks) in enhanced and base layers produced from the same video source may be out of phase since a variety of different downsampling techniques and downsampling ratios (i.e., horizontal and/or vertical size reduction ratios) may be employed.
FIG. 12 illustrates a phase relationship between enhanced and base layers. A base layer may be produced (i) by sampling a video source at lower spatial resolution separately from an enhanced layer or (ii) by downsampling an enhanced layer with higher spatial resolution. In the example of FIG. 12, the downsampling ratio between the enhanced and base layers is ⅔.
A video signal is managed as separate components, namely, a luma component and two chroma components. The luma component is associated with luminance information Y and the two chroma components are associated with chrominance information Cb and Cr. A ratio of 4:2:0 (Y:Cb:Cr) between luma and chroma signals is widely used. Samples of the chroma signal are typically located midway between samples of the luma signal. When an enhanced layer and/or a base layer are produced directly from a video source, luma and chroma signals of the enhanced layer and/or the base layer are sampled so as to satisfy the 4:2:0 ratio and a position condition according to the 4:2:0 ratio.
In the above case (i), the enhanced and base layers may be out of phase as shown in section (a) of FIG. 12 since the enhanced and base layers may have different sampling positions. In the example of section (a), luma and chroma signals of each of the enhanced and base layers satisfy the 4:2:0 ratio and a position condition according to the 4:2:0 ratio.
In the above case (ii), the base layer is produced by downsampling luma and chroma signals of the enhanced layer by a specific ratio. If the base layer is produced such that luma and chroma signals of the base layer are in phase with luma and chroma signals of the enhanced layer, the luma and chroma signals of the base layer do not satisfy a position condition according to the 4:2:0 ratio as illustrated in section (b) of FIG. 12.
In addition, if the base layer is produced such that luma and chroma signals of the base layer satisfy a position condition according to the 4:2:0 ratio, the chroma signal of the base layer is out of phase with the chroma signal of the enhanced layer as illustrated in section (c) of FIG. 12. In this case, if the chroma signal of the base layer is upsampled by a specific ratio according to the inter-layer prediction method, the upsampled chroma signal of the base layer is out of phase with the chroma signal of the enhanced layer.
Also in case (ii), the enhanced and base layers may be out of phase as illustrated in section (a).
That is, the phase of the base layer may be changed in the downsampling procedure for producing the base layer and in the upsampling procedure of the inter-layer prediction method, so that the base layer is out of phase with the enhanced layer, thereby reducing coding efficiency.
Also, video frames in sequences of different layers may have different aspect ratios. For example, video frames of the higher sequence (i.e., the enhanced layer) may have a wide aspect ratio of 16:9, whereas video frames of the lower sequence (i.e., the base layer) may have a narrow aspect ratio of 4:3. In this case, there maybe a need to determine which part of a base layer picture is to be used for an enhanced layer picture or for which part of the enhanced layer picture the base layer picture is to be used when performing prediction of the enhanced layer picture.