1. Field of the Invention
The present invention relates to video encoding and decoding apparatuses for encoding a picture signal at a high efficiency and transmitting or storing the encoded signal and, more particularly, to video encoding and decoding apparatuses with a scalable function capable of scalable coding by which the resolution and the image quality can be changed into multiple layers.
2. Description of the Related Art
Generally, a picture signal is compression-encoded before being transmitted or stored because the signal has an enormous amount of information. To encode a picture signal at a high efficiency, pictures whose unit is a frame are divided into a plurality of blocks in units of a predetermined number of pixels. Orthogonal transform is performed for each block to separate the spacial frequency of a picture into frequency components. Each frequency component is obtained as a transform coefficient and encoded.
As one function of video encoding, a scalability function is demanded by which the image quality (SNR: Signal to Noise Ratio), the spacial resolution, and the time resolution can be changed step by step by partially decoding a bit stream.
The scalability function is incorporated into Video Part (IS13818-2) of MPEG2 which is standardized in ISO/IEC.
This scalability is realized by hierarchical encoding methods. The scalability includes an encoder and a decoder of SNR scalability and also includes an encoder and a decoder of spacial scalability.
In the encoder, layers are divided into a base layer (lower layer) whose image quality is low and an enhancement layer (upper layer) whose image quality is high.
In the base layer, data is encoded by MPEG1 or MPEC2. In the enhancement layer, the data encoded by the base layer is reconstructed and the reconstructed base layer data is subtracted from the enhancement layer data. Only the resulting error is quantized by a quantization step size smaller than the quantization step size in the base layer and encoded. That is, the data is more finely quantized and encoded. The resolution can be increased by adding the enhancement layer information to the base layer information, and this makes the transmission and storage of high-quality pictures feasible.
As described above, pictures are divided into the base layer and the enhancement layer, data encoded by the base layer is reconstructed, the reconstructed data is subtracted from the original data, and only the resulting error is quantized by a quantization step size smaller than the quantization step size in the base layer and encoded. Consequently, pictures can be encoded and decoded at a high resolution. This technique is called SNR scalability.
In the encoder, an input picture is supplied to the base layer and the enhancement layer. In the base layer, the input picture is so processed as to obtain an error from a motion compensation prediction value obtained from a picture of the previous frame, and the error is subjected to orthogonal transform (DCT). The transform coefficient is quantized and variable-length-decoded to obtain a base layer output. The quantized output is dequantized, subjected to inverse DCT, and added with the motion compensation prediction value of the previous frame, thereby obtaining a frame picture. Motion compensation prediction is performed on the basis of this frame picture to obtain the motion compensation prediction value of the previous frame.
In the enhancement layer, on the other hand, the input picture is delayed until the prediction value is obtained from the base layer, and processing is performed to obtain an error from a motion compensation prediction value in the enhancement layer obtained from the picture of the previous frame. The error is then subjected to orthogonal transform (DCT), and the transform coefficient is; corrected by using the dequantized output from the base layer, quantized, and variable-length-decoded, thereby obtaining an enhancement layer output. The quantized output is dequantized, added with the motion compensation prediction value of the previous frame obtained in the base layer, and subjected to inverse DCT. A frame picture is obtained by adding to the result of the inverse DCT the motion compensation prediction value of the previous frame obtained in the enhancement layer. Motion compensation prediction is performed on the basis of this frame picture to obtain a motion compensation prediction value of the previous frame in the enhancement layer.
In this way, video pictures can be encoded by using the SNR scalability. Note that although this SNR scalability is expressed by two layers, various SNR reconstructed pictures can be obtained by increasing the number of layers.
In the decoder, the variable-length decoded data of the enhancement layer and the variable-length encoded data of the base layer which are separately supplied are separately variable-length-decoded and dequantized. The two dequantized data are added, and the result is subjected to inverse DCT. The picture signal is restored by adding the motion compensation prediction value of the previous frame to the result of the inverse DCT. Also, motion compensation prediction is performed on the basis of a picture in an immediately previous frame obtained from the restored picture signal, thereby obtaining a motion compensation prediction value of the previous frame.
The foregoing are examples of encoding and decoding using the SNR scalability.
On the other hand, the spacial scalability is done on the basis of the spacial resolution, and encoding is separately performed in a base layer whose spacial resolution is low and an enhancement layer whose spacial resolution is high. In the base layer, encoding is performed by using a normal MPEG2 encoding method. In the enhancement layer, up-sampling (in which a high-resolution picture is formed by adding pixels such as average values between pixels of a low-resolution picture) is performed for the picture from the base layer to thereby form a picture having the same size as the enhancement layer. Prediction is adaptively performed on the basis of motion compensation prediction using the picture of the enhancement layer and motion compensation prediction using the up-sampled picture. Consequently, encoding can be performed at a high efficiency.
The spacial scalability exists in order to achieve backward compatibility by which, for example, a portion of a bit stream of MPEG2 can be extracted and decoded by MPEG1. That is, the spacial scalability is not a function capable of reconstructing pictures with various resolutions (reference: xe2x80x9cSpecial Edition MPEGxe2x80x9d, Television Magazine, Vol. 49, No. 4, pp. 458-463, 1995).
More specifically, the video encoding technology of MPEG2 aims to accomplish high-efficiency encoding of high-quality pictures and high-quality reconstruction of the encoded pictures. In this technology, pictures faithful to encoded pictures can be reconstructed.
Unfortunately, with the spread of multimedia, there is a demand for a reconstructing apparatus capable of fully decoding data of high-quality pictures encoded at a high efficiency, as a system on the reconstruction side. In addition, there are demands for a system such as a portable system which is only required to reconstruct pictures regardless of whether the image quality is high, and for a simplified system by which the system price is decreased.
To meet these demands, a picture is divided into, e.g., 8xc3x978 pixel matrix blocks and DCT is performed in units of blocks. In this case, 8xc3x978 transform coefficients are obtained. Although it is originally necessary to decode the data from the first low frequency component to the eighth low frequency component, the date is decoded from the first low frequency component to the fourth low frequency component or from the first low frequency component to the sixth low frequency component. In this manner decoding is simplified by restoring the picture by reconstructing the signal of 4xc3x974 resolution or the signal of 6xc3x976 resolution, rather than the signal of 8xc3x978 resolution.
Unfortunately, when a picture which originally has 8xc3x978 information is restored by using 4xc3x974 or 6xc3x976 information, a mismatch occurs between the restored value and the motion compensation prediction value, and errors are accumulated. This significantly degrades the picture. Therefore, it is an important subject to overcome this mismatch between the encoding side and the decoding side.
Note that as a method of converting the spacial resolution in order to control the difference between the spacial resolutions on the encoding side and the decoding side, there is another method, although the method is not standardized, by which the spacial resolution is made variable by inversely converting some coefficients of orthogonal transform (e.g., DCT (Discrete Cosine Transform)) by an order smaller than the original order.
Unfortunately, when motion compensation prediction is performed by using the resolution-converted picture, image quality degradation called a draft resulting from the motion compensation prediction occurs in the reconstructed picture (reference: Iwahashi et al., xe2x80x9cMotion Compensation for Reducing Drift in Scalable Decoderxe2x80x9d, Shingaku Giho IE94-97, 1994).
Accordingly, the method has a problem as a technique to overcome the mismatch between the encoding side and the decoding side.
On the other hand, the spacial scalability exists in order to achieve backward compatibility by which, for example, a portion of a bit stream of MPEG2 can be extracted and decoded by MPEG1. That is, the spacial scalability is not a function of capable of reconstructing pictures with various resolutions (reference: xe2x80x9cSpecial Edition MPEGxe2x80x9d, Television Magazine, Vol. 49, No. 4, pp. 458-463, 1995). Since hierarchical encoding is performed to realize the scalability function as described above, information is divisionally encoded and this decreases the coding efficiency.
A video encoding system belonging to a category called mid-level encoding is proposed in xe2x80x9cJ. Y. A. Wang et. al. , xe2x80x9cApplying Mid-level Vision Techniques for Video Data Compression and Manipulationxe2x80x9d, M.I.T. Media Lab. Tech. Report No. 263, February 1994xe2x80x9d.
In this system, a background and an object are separately encoded. To separately encode the background and the object, an alpha-map signal which represents the shape of the object and the position of the object in a frame is necessary. An alpha-map signal of the background can be uniquely obtained from the alpha-map signal of the object.
In an encoding system like this, a picture with an arbitrary shape must be encoded. As a method of encoding an arbitrary-shape picture, there is an arbitrary-shape picture signal orthogonal transform method described in previously filed Japanese Patent Application No. 7-97073. In this orthogonal transform method, the values of pixels contained in a specific domain are separated from an input edge block signal by a separation circuit (SEP), and an average value calculation circuit (AVE) calculates an average value a of the separated pixel values.
If an alpha-map indicates a pixel in the specific domain, a selector (SEL) outputs the pixel value in the specific domain stored in a block memory (MEM). If the alpha-mad indicates another pixel, the selector outputs the average value a. The block signal thus processed is subjected to two-dimensional DCT to obtain transform coefficients for pixels in the specific domain.
On the other hand, inverse transform is accomplished by separating the pixel values in the specific domain from pixel values in the block obtained by performing inverse DCT for the transform coefficient.
As described above, in the scalable encoding method capable of dividing pictures into multiple layers, the coding efficiency is sometimes greatly decreased when video pictures are encoded. In addition, scalable encoding by which the resolution and the image quality can be made variable is also required in an arbitrary-shape picture encoding apparatus which separately encodes the background and the object. It is also necessary to improve the efficiency of motion compensation prediction encoding for an arbitrary-shape picture.
On the other hand, the mid-level encoding system has the advantage that a method of evenly arranging the internal average value of the object in the background can be realized with a few calculations. However, a step of pixel values is sometimes formed in the boundary between the object and the background. If DCT is performed in a case like this, a large quantity of high-frequency components are generated and so the amount of codes is not decreased.
It is an object of the present invention to provide an encoding apparatus and a decoding apparatus capable of improving the coding efficiency when video pictures are encoded by a scalable encoding method by which pictures can be divided into multiple layers.
It is another object of the present invention to provide a scalable encoding apparatus and a scalable decoding apparatus capable of mating the resolution and the image quality variable and improving the coding efficiency in an arbitrary-shape picture encoding apparatus which separately encodes a background and an object.
It is still another object of the present invention to improve the efficiency of motion compensation prediction encoding for arbitrary-shape pictures.
It is still another object of the present invention to alleviate the drawback that the code amount is not decreased due to the generation of a large quantity of high-frequency components when DCT is performed, even if a step of pixel values is formed in the boundary between an object and a background when a method of evenly arranging an internal average value of the object in the background is used.
According to the present invention, there is provided a video encoding apparatus comprising: an orthogonal transform circuit for orthogonally transforming an input picture signal to obtain a plurality of transform coefficients; a first local decoder for outputting first transform coefficients for a fine motion compensation prediction picture on the basis of a previous picture; a second local decoder for outputting second transform coefficients for a coarse motion compensation prediction picture on the basis of a current picture corresponding to the input picture signal; means for detecting a degree of motion compensation prediction in the second local decoder; a selector for selectively outputting the first and second transform coefficients in accordance with the degree of motion compensation prediction; a first calculator for calculating a difference between the transform coefficients of the orthogonal transform circuit and ones of the first and second transform coefficients which are selected by the selector, and outputting a motion compensation prediction error signal; a first quantizer for quantizing the motion compensation prediction error signal from the first adder an( outputting a first quantized motion compensation prediction error signal; a second calculator for calculating a difference between the second transform coefficients from the second local decoder and the transform coefficients from the orthogonal transform circuit, and outputting a second motion compensation prediction error signal; a second quantizer for quantizing the motion compensation prediction error signal from the second calculator, and outputting a second quantized motion compensation prediction error signal; and an encoder for encoding the first and second quantized motion compensation prediction error signals and outputting encoded signals.
According to the present invention, there is provided a video encoding apparatus comprising: an orthogonal transform circuit for dividing an input video signal into a plurality of blocks each containing Nxc3x97N pixels and orthogonally transforming the input video signal in units of blocks to obtain a plurality of transform coefficients divided in spacial frequency bands; a first motion prediction processing section for performing motion compensation prediction processing for the plurality of transform coefficients in order to obtain an upper-layer motion compensation prediction signal having the number of data enough to obtain a high image quality; a second motion prediction processing section for performing motion compensation prediction processing for the plurality of transform coefficients in order to obtain a lower-layer motion compensation prediction signal upon reducing the number of data; a decision section for deciding in motion compensation on the basis of the lower-layer motion compensation prediction signal whether motion compensation prediction is correct; a selector for selecting the upper-layer motion compensation prediction signal in response to a decision representing a correct motion compensation prediction from the decision section, and the lower-layer motion compensation prediction signal in response to a decision representing an incorrect motion compensation prediction; and an encoder for encoding one of the upper-lawyer motion compensation prediction signal and the lower-layer motion compensation prediction signal which is selected by the selector.
According to the present invention, there is provided a video encoding apparatus for realizing SNR scalability in M layers, comprising: an orthogonal transform circuit for dividing an input video signal into a plurality of blocks each containing Nxc3x97N pixels and orthogonally transforming the input video signal in units of blocks to obtain a plurality of transform coefficients divided in spacial frequency bands; a first motion compensation prediction processing section for performing motion compensation prediction processing for the plurality of transform coefficients in order to obtain an mth-layer (m=2 to M) motion compensation prediction signal; a second motion compensation prediction processing section for performing motion compensation prediction processing for the plurality of transform coefficients in order to obtain an (mxe2x88x921)th-layer motion compensation prediction signal; switching means for selecting the mth-layer motion compensation prediction signal of the first motion compensation prediction processing section in order to obtain an mth-layer prediction value when a quantized output from the second motion compensation prediction processing section is 0, and switching between the mth-layer motion compensation prediction signal and the (mxe2x88x921)th-layer motion compensation prediction signal in units of transform coefficients in order to select the (mxe2x88x921)th-layer motion compensation prediction signal when the quantized output is not less than 1; means for calculating a difference signal between an (mxe2x88x921)th-layer dequantized output from the second motion compensation prediction processing section and an mth-layer motion compensation prediction error signal obtained by a difference between the mth-layer motion compensation prediction signal and the transform coefficient from the orthogonal transform circuit; and encoding means for quantizing and encoding the difference signal to output an encoded bit stream.
According to the present invention, there is provided a video encoding/decoding system comprising: a video encoding apparatus for realizing SNR (Signal to Noise Ratio) scalability in M layers, which includes an orthogonal transform circuit for dividing an input video signal into a plurality of blocks each containing Nxc3x97N pixels and orthogonally transforming the input video signal in units of blocks to obtain a plurality of transform coefficients divided in spacial frequency bands, a first motion compensation prediction processing section for performing motion compensation prediction processing for the plurality of transform coefficients in order to obtain an mth-layer (m=2 to M) motion compensation prediction signal, a second motion compensation prediction processing section for performing motion compensation prediction processing for the plurality of transform coefficients in order to obtain an (mxe2x88x921)th-layer motion compensation prediction signal, switching means for selecting the mth-layer motion compensation prediction signal of the first motion compensation prediction processing section in order to obtain an mth-layer prediction value when a quantized output from the second motion compensation prediction processing section is 0, and switching between the mth-layer motion compensation prediction signal and the (mxe2x88x921)th-layer motion compensation prediction signal in units of transform coefficients in order to select the (mxe2x88x921)th-layer motion compensation prediction signal when the quantized output is not less than 1, means for calculating a difference signal between an (mxe2x88x921)th-layer dequantized output from the second motion compensation prediction processing section and an mth-layer motion compensation prediction error signal obtained by a difference between the mth-layer motion compensation prediction signal and the transform coefficient from the orthogonal transform circuit, and encoding means for quantizing and encoding the difference signal to output an encoded bit stream; and a video decoding apparatus which includes means for extracting codes up to a code in the mth (m=2 to M) layer from the encoded bit stream from the video encoding apparatus, decoding means for decoding the codes of respective layers up to the mth layer, dequantization means for dequantizing, in the respective layers, the quantized values decoded by the decoding means, switching means for switching the mth-layer (m=2 to M) motion compensation prediction value and the (mxe2x88x921)th-layer motion compensation prediction value in units of transform coefficients, and outputting the mth-layer motion compensation prediction value for the quantized output of 0 in the (mxe2x88x921)th layer and the (mxe2x88x921)th-layer motion compensation prediction value for the quantized output of not less than 1 in the (mxe2x88x921)th layer in units of transform coefficients in order to obtain the mth-layer prediction value, and means for adding the mth-layer motion compensation prediction value and the (mxe2x88x921)th-layer motion compensation prediction value to reconstruct the mth-layer motion compensation prediction error signal.
According to the present invention, there is provided a video encoding apparatus comprising: an orthogonal transform circuit for dividing an input video signal into a plurality of blocks each containing Nxc3x97N pixels and orthogonally transforming an arbitrary-shape picture in units of blocks to obtain a plurality of transform coefficients; means for encoding and outputting an alpha-map signal for discriminating a background of a picture from an object thereof; means for calculating an average value of pixel values of an object portion using the alpha-map signal in units of blocks; means for assigning the average value to a background portion of the block; means for deciding using the alpha-map signal whether a pixel in the object is close to the background; means for compressing, about the average value, the pixel in the object decided to be close to the background; and means for orthogonally transforming each block to output an orthogonal transform coefficient.
Additional objects and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objects and advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the appended claims.