1. Field of the Invention
The present invention relates to video encoding and decoding apparatuses for encoding a picture signal at a high efficiency and transmitting or storing the encoded signal and, more particularly, to video encoding and decoding apparatuses with a scalable function capable of scalable coding by which the resolution and the image quality can be changed into multiple layers.
2. Description of the Related Art
Generally, a picture signal is compression-encoded before being transmitted or stored because the signal has an enormous amount of information. To encode a picture signal at a high efficiency, pictures whose unit is a frame are divided into a plurality of blocks in units of a predetermined number of pixels. Orthogonal transform is performed for each block to separate the spacial frequency of a picture into frequency components. Each frequency component is obtained as a transform coefficient and encoded.
As one function of video encoding, a scalability function is demanded by which the image quality (SNR: Signal to Noise Ratio), the spacial resolution, and the time resolution can be changed step by step by partially decoding a bit stream.
The scalability function is incorporated into Video Part (IS13818-2) of MPEG2 which is standardized in ISO/IEC.
This scalability is realized by hierarchical encoding methods. The scalability includes an encoder and a decoder of SNR scalability and also includes an encoder and a decoder of spacial scalability.
In the encoder, layers are divided into a base layer (lower layer) whose image quality is low and an enhancement layer (upper layer) whose image quality is high.
In the base layer, data is encoded by MPEG1 or MPEG2. In the enhancement layer, the data encoded by the base layer is reconstructed and the reconstructed base layer data is subtracted from the enhancement layer data. Only the resulting error is quantized by a quantization step size smaller than the quantization step size in the base layer and encoded. That is, the data is more finely quantized and encoded. The resolution can be increased by adding the enhancement layer information to the base layer information, and this makes the transmission and storage of high-quality pictures feasible.
As described above, pictures are divided into the base layer and the enhancement layer, data encoded by the base layer is reconstructed, the reconstructed data is subtracted from the original data, and only the resulting error is quantized by a quantization step size smaller than the quantization step size in the base layer and encoded. Consequently, pictures can be encoded and decoded at a high resolution. This technique is called SNR scalability.
In the encoder, an input picture is supplied to the base layer and the enhancement layer. In the base layer, the input picture is so processed as to obtain an error from a motion compensation prediction value obtained from a picture of the previous frame, and the error is subjected to orthogonal transform (DCT). The transform coefficient is quantized and variable-length-decoded to obtain a base layer output. The quantized output is dequantized, subjected to inverse DCT, and added with the motion compensation prediction value of the previous frame, thereby obtaining a frame picture. Motion compensation prediction is performed on the basis of this frame picture to obtain the motion compensation prediction value of the previous frame.
In the enhancement layer, on the other hand, the input picture is delayed until the prediction value is obtained from the base layer, and processing is performed to obtain an error from a motion compensation prediction value in the enhancement layer obtained from the picture of the previous frame. The error is then subjected to orthogonal transform (DCT), and the transform coefficient is corrected by using the dequantized output from the base layer, quantized, and variable-length-decoded, thereby obtaining an enhancement layer output. The quantized output is dequantized, added with the motion compensation prediction value of the previous frame obtained in the base layer, and subjected to inverse DCT. A frame picture is obtained by adding to the result of the inverse DCT the motion compensation prediction value of the previous frame obtained in the enhancement layer. Motion compensation prediction is performed on the basis of this frame picture to obtain a motion compensation prediction value of the previous frame in the enhancement layer.
In this way, video pictures can be encoded by using the SNR scalability. Note that although this SNR scalability is expressed by two layers, various SNR reconstructed pictures can be obtained by increasing the number of layers.
In the decoder, the variable-length decoded data of the enhancement layer and the variable-length encoded data of the base layer which are separately supplied are separately variable-length-decoded and dequantized. The two dequantized data are added, and the result is subjected to inverse DCT. The picture signal is restored by adding the motion compensation prediction value of the previous frame to the result of the inverse DCT. Also, motion compensation prediction is performed on the basis of a picture in an immediately previous frame obtained from the restored picture signal, thereby obtaining a motion compensation prediction value of the previous frame.
The foregoing are examples of encoding and decoding using the SNR scalability.
On the other hand, the spacial scalability is done on the basis of the spacial resolution, and encoding is separately performed in a base layer whose spacial resolution is low and an enhancement layer whose spacial resolution is high. In the base layer, encoding is performed by using a normal MPEG2 encoding method. In the enhancement layer, up-sampling (in which a high-resolution picture is formed by adding pixels such as average values between pixels of a low-resolution picture) is performed for the picture from the base layer to thereby form a picture having the same size as the enhancement layer. Prediction is adaptively performed on the basis of motion compensation prediction using the picture of the enhancement layer and motion compensation prediction using the up-sampled picture. Consequently, encoding can be performed at a high efficiency.
The spacial scalability exists in order to achieve backward compatibility by which, for example, a portion of a bit stream of MPEG2 can be extracted and decoded by MPEG1. That is, the spacial scalability is not a function capable of reconstructing pictures with various resolutions (reference: "Special Edition MPEG", Television Magazine, Vol. 49, No. 4, pp. 458-463, 1995).
More specifically, the video encoding technology of MPEG2 aims to accomplish high-efficiency encoding of high-quality pictures and high-quality reconstruction of the encoded pictures. In this technology, pictures faithful to encoded pictures can be reconstructed.
Unfortunately, with the spread of multimedia, there is a demand for a reconstructing apparatus capable of fully decoding data of high-quality pictures encoded at a high efficiency, as a system on the reconstruction side. In addition, there are demands for a system such as a portable system which is only required to reconstruct pictures regardless of whether the image quality is high, and for a simplified system by which the system price is decreased.
To meet these demands, a picture is divided into, e.g., 8.times.8 pixel matrix blocks and DCT is performed in units of blocks. In this case, 8.times.8 transform coefficients are obtained. Although it is originally necessary to decode the data from the first low frequency component to the eighth low frequency component, the data is decoded from the first low frequency component to the fourth low frequency component or from the first low frequency component to the sixth low frequency component. In this manner decoding is simplified by restoring the picture by reconstructing the signal of 4.times.4 resolution or the signal of 6.times.6 resolution, rather than the signal of 8.times.8 resolution.
Unfortunately, when a picture which originally has 8.times.8 information is restored by using 4.times.4 or 6.times.6 information, a mismatch occurs between the restored value and the motion compensation prediction value, and errors are accumulated. This significantly degrades the picture. Therefore, it is an important subject to overcome this mismatch between the encoding side and the decoding side.
Note that as a method of converting the spacial resolution in order to control the difference between the spacial resolutions on the encoding side and the decoding side, there is another method, although the method is not standardized, by which the spacial resolution is made variable by inversely converting some coefficients of orthogonal transform (e.g., DCT (Discrete Cosine Transform)) by an order smaller than the original order.
Unfortunately, when motion compensation prediction is performed by using the resolution-converted picture, image quality degradation called a drift resulting from the motion compensation prediction occurs in the reconstructed picture (reference: Iwahashi et al., "Motion Compensation for Reducing Drift in Scalable Decoder", Shingaku Giho IE94-97, 1994).
Accordingly, the method has a problem as a technique to overcome the mismatch between the encoding side and the decoding side.
On the other hand, the spacial scalability exists in order to achieve backward compatibility by which, for example, a portion of a bit stream of MPEG2 can be extracted and decoded by MPEG1. That is, the spacial scalability is not a function of capable of reconstructing pictures with various resolutions (reference: "Special Edition MPEG", Television Magazine, Vol. 49, No. 4, pp. 458-463, 1995). Since hierarchical encoding is performed to realize the scalability function as described above, information is divisionally encoded and this decreases the coding efficiency.
A video encoding system belonging to a category called mid-level encoding is proposed in "J. Y. A. Wang et. al., "Applying Mid-level Vision Techniques for Video Data Compression and Manipulation", M.I.T. Media Lab. Tech. Report No. 263, February 1994".
In this system, a background and an object are separately encoded. To separately encode the background and the object, an alpha-map signal which represents the shape of the object and the position of the object in a frame is necessary. An alpha-map signal of the background can be uniquely obtained from the alpha-map signal of the object.
In an encoding system like this, a picture with an arbitrary shape must be encoded. As a method of encoding an arbitrary-shape picture, there is an arbitrary-shape picture signal orthogonal transform method described in previously filed Japanese Patent Application No. 7-97073. In this orthogonal transform method, the values of pixels contained in a specific domain are separated from an input edge block signal by a separation circuit (SEP), and an average value calculation circuit (AVE) calculates an average value a of the separated pixel values.
If an alpha-map indicates a pixel in the specific domain, a selector (SEL) outputs the pixel value in the specific domain stored in a block memory (MEM). If the alpha-map indicates another pixel, the selector outputs the average value a. The block signal thus processed is subjected to two-dimensional DCT to obtain transform coefficients for pixels in the specific domain.
On the other hand, inverse transform is accomplished by separating the pixel values in the specific domain from pixel values in the block obtained by performing inverse DCT for the transform coefficient.
As described above, in the scalable encoding method capable of dividing pictures into multiple layers, the coding efficiency is sometimes greatly decreased when video pictures are encoded. In addition, scalable encoding by which the resolution and the image quality can be made variable is also required in an arbitrary-shape picture encoding apparatus which separately encodes the background and the object. It is also necessary to improve the efficiency of motion compensation prediction encoding for an arbitrary-shape picture.
On the other hand, the mid-level encoding system has the advantage that a method of evenly arranging the internal average value of the object in the background can be realized with a few calculations. However, a step of pixel values is sometimes formed in the boundary between the object and the background. If DCT is performed in a case like this, a large quantity of high-frequency components are generated and so the amount of codes is not decreased.