1. Field of the Invention
The present invention relates to a motion picture encoding system which can encode a series of interlaced motion picture objects with a high encoding efficiency, and to a motion picture decoding system which can decode a series of coded interlaced motion picture objects.
2. Description of the Prior Art
Video encoding verification model or VM of Motion Picture Expert Group Phase-4 (MPEG-4) which is being standardized by JTC11/SC29/WG11 of the ISO/IEC has been known as an example of a method of encoding shape information for use in a prior art motion picture encoding system. The contents of the video encoding VM are varying with ongoing standardization efforts by MPEG-4. Hereinafter, a description of the video encoding VM will be made assuming that the video encoding VM is the one of version 7.0, which will be referred to as the VM.
In the VM, a sequence of motion pictures is defined as a collection of motion picture objects each having an arbitrary shape with respect to time and space, and encoding process is carried out for each motion picture object. Referring now to FIG. 25, there is illustrated a diagram showing the structure of video data in the VM. In the VM, one specific scene of a motion picture is called video session or VS. Furthermore, one motion picture object which can vary with time is called video object or VO which is a component of a VS. Accordingly, a VS is defined as a collection of one or more VO's.
One video object layer or VOL is a component of a VO and is comprised of a plurality of video object planes or VOP's. A VOL is provided with the aim of displaying motion pictures in a hierarchical form. An important factor in providing a plurality of layers for each VO with respect to time is the frame rate. An important factor in providing a plurality of layers for each VO with respect to space is the display roughness. Each VO corresponds to each of a plurality of objects in one scene such as each of conferees that joins a TV meeting or the background that can be seen behind the conferees. Each VOP is an image data representing the state of a corresponding VO at each time, which corresponds to one frame and which is a unit on which an encoding process is performed.
Referring next to FIG. 26, there is illustrated a view showing an example of VOP's in one scene. Two VOP's, i.e., VOP1 representing a person and VOP2 representing a painting hung on the wall behind the person are shown in FIG. 26. Each VOP is constructed from a texture data showing the color light and dark level of each VOP and a shape data showing the shape of each VOP. The texture data of each pixel is comprised of an 8-bit luminance signal and a chrominance signal having one half the size of the luminance signal in both horizontal and vertical dimensions. The shape data of each pixel is a matrix of binary values in which each element is set to 1 when each element is in the interior of a VOP; otherwise, each element is set to 0. Each shape data has the same size as a corresponding luminance signal. In representation of a motion picture using VOP's, a conventional frame image can be formed by arranging a plurality of VOP's within one frame, as shown in FIG. 26. When only one VO exists in a motion picture sequence, each VOP is synonymous with each frame. In this case, each VOP has no shape data, and therefore only the texture data of each VOP is encoded.
Referring next to FIG. 27, there is illustrated a block diagram showing the structure of a prior art VOP encoding device for use in a VM encoding system disclosed in ISO/IEC JTC11/SC29/WG11, MPEG97/N1642, MPEG-4 Verification Mode Version 7.0. In the figure, reference character P1 denotes an input VOP data, P2 denotes a shape data which is extracted from the input VOP data, P3 denotes a shape encoding unit which can encode the shape data P2, P4 denotes a shape memory which can store a local decoded shape data P7 furnished by the shape encoding unit P3, P5 denotes a motion vector of shape furnished by the shape encoding unit P3, and P6 denotes a coded shape data furnished by the shape encoding unit P3.
Furthermore, reference character P8 denotes a texture data which is extracted from the input VOP data P1, P9 denotes a texture motion detecting unit which receives the texture data P8 and then detects a motion vector of texture P10, P11 denotes which receives the motion vector of texture P10 and delivers a prediction data for texture P12, P13 denotes a texture encoding unit which can encode the prediction data for texture P12, P14 denotes a coded texture data furnished by the texture encoding unit 13, P16 denotes a texture memory which can store the local decoded texture data P15 furnished by the texture encoding unit P13, and P17 denotes a variable length encoding and multiplexing unit which can receive the motion vector of shape P5, the coded shape data P6, the motion vector of texture P10, and the coded texture data P14, and then furnishes a coded bitstream.
In operation, the input VOP data P1 is divided into the shape data P2 and the texture data P8 first. The shape data P2 is delivered to the shape encoding unit P3, and the texture data P8 is delivered to the texture motion detecting unit P9. Then each of the shape data and the texture data is divided or partitioned into multiples of 16.times.16 pixels blocks and the encoding process is done per 16.times.16 pixels block. As shown in FIG. 26, each of the plurality of blocks of the shape data per which the shape encoding process is done is hereafter referred to as an alpha block, and each of the plurality of blocks of the texture data per which the texture encoding process is done is hereafter referred to as a macroblock.
First, the description will be directed to the encoding process for the shape data. Referring next to FIG. 28, there is illustrated a block diagram showing the structure of the shape encoding unit P3. In the figure, reference character P19 denotes a shape motion detecting unit which can receive the shape data P2 and then detect a motion vector of shape P5, P20 denotes a shape motion compensation unit which can receive the motion vector of shape P5 and then furnish a prediction data for shape P21, P22 denotes an arithmetic encoding unit which can receive the prediction data for shape P21 and then furnish a coded shape data P23, and P24 denotes a shape encoding mode selecting unit which can receive the coded shape data P23 and then furnish a coded shape data P6.
First, a description will be made as to the motion detection which is carried out for the input shape data P2. When the shape motion detection unit P19 receives the shape data P2 of each of a plurality of alpha blocks into which the shape data of the VOP has been partitioned, it detects a motion vector of shape P5 for each alpha block from motion vectors of shape of other alpha blocks around the current alpha block, which have been stored in the shape motion detecting unit P19, and the motion vectors of texture of macroblocks around the corresponding macroblock at the same location, which have been furnished by the texture motion detecting unit P9. A block matching method which has been used for detecting the motion vector of texture for each macroblock can be used as a method of detecting the motion vector of shape for each alpha block. Using the method, a motion vector of shape can be detected for each alpha block by searching a small area in the vicinity of the motion vectors of shape of other alpha blocks referred around the current alpha block and the motion vectors of texture of macroblocks around the corresponding macroblock at the same location as the alpha block being tested. The motion vector of shape P5 of each alpha block to be encoded is delivered to the variable length encoding and multiplexing unit P17 and is then multiplexed into a coded bitstream P18 as needed.
Next, a description will be made as to the motion compensation and the arithmetic encoding for the shape data of each alpha block to be encoded. The shape motion compensation unit P20 generates and furnishes a prediction data for shape P21 used for the encoding process from a reference shape data stored in the shape memory P4 according to the motion vector of shape P5 determined in the above-mentioned process. The prediction data for shape P21, together with the shape data P2 of each alpha block to be encoded, is applied to the arithmetic encoding unit P22. The arithmetic encoding process is then done for each alpha block to be encoded. The arithmetic encoding method is the encoding method that can adapt dynamically to the frequency of occurrence of a series of symbols. Therefore, it is necessary to obtain the probability that the value of each pixel in the alpha block currently being encoded is 0 or 1.
In the VM, the arithmetic encoding process is done in the following manner.
(1) A pixel distribution pattern or context around the target pixel to be arithmetic encoded is examined.
The context construction used in intra or intra-coding mode, that is, when encoding the shape data of the alpha block being decoded by using only the shape data within the VOP currently being encoded is shown in FIG. 29a. The context construction used in inter or inter-coding mode, that is, when encoding the shape data of the alpha block being encoded by using the prediction data for shape which has been extracted in the motion compensation process as well is shown in FIG. 29b. In the figures, the target pixel to be encoded is marked with `?`. In either pattern, a context number is computed according to the following equation. ##EQU1## where Ck shows the value of a pixel in the vicinity of the pixel to be encoded as shown in FIGS. 29a and 29b.
(2) The probability that the value of the target pixel to be encoded is 0 or 1 is obtained by indexing a probability table using the context number.
(3) The arithmetic encoding is carried out according to the indexed probability of the value of the target pixel to be encoded.
The procedures mentioned above are carried out in both the intra mode and the inter mode. The shape encoding mode selecting unit P24 selects either the coded result obtained in the intra shape encoding mode or the coded result obtained in the inter shape encoding mode. The shape encoding mode selecting unit P24 selects the one having a shorter code length. The final coded shape data P6 thus obtained, including information indicating the selected shape encoding mode, is delivered to the variable length encoding and multiplexing unit P17 in which the shape coded data as well as the corresponding texture data is multiplexed into the coded bitstream P18 according to a given syntax (or grammatical rules which coded data must obey). The local decoded shape data P7 of the alpha block is stored in the shape memory P4 and is also furnished to the texture motion detecting unit P9, the texture motion compensation unit P11, and the texture encoding unit P13.
Next, a description will be made as to the texture encoding. After the texture data of the VOP to be encoded is partitioned into a plurality of macroblocks, the texture data P8 of a macroblock to be encoded is applied to the texture motion detecting unit P9. The texture motion detecting unit P9 then detects a motion vector of texture P10 from the texture data P8. In the case where the texture data P8 of the macroblock to be encoded is an interlaced signal, the texture motion detecting unit P9 can perform a frame-based motion detecting operation on each macroblock composed of lines from the two fields alternately, one of which contains lines each spatially located above the corresponding line of the other field is called top field and the other one of which is called bottom field, and perform a field-based motion detecting operation on each macroblock composed of lines from only one of the two fields, independently, as shown in FIG. 30. By using this motion detecting process, a reduction in the encoding efficiency due to the difference in the position of a moving object between the pair of two fields of an interlaced frame can be prevented, and therefore the efficiency of the frame-based prediction can be improved.
The texture motion compensation unit P11 generates and furnishes a prediction data for texture P12 from a reference texture data stored in the texture memory P16 according to the motion vector of texture P10 of the macroblock to be encoded from the texture motion detecting unit P9. The prediction data for texture P12 is then delivered to the texture encoding unit P13 as well as the texture data P8. From the texture data P8 (or intra texture data) and the difference (or inter texture data) between the texture data P8 and the prediction data for texture P12, the texture encoding unit P13 selects the one which offers a higher degree of encoding efficiency, and then compresses and encodes the selected data using DCT and scalar quantization. When the texture data P8 is an interlaced signal, the texture motion detecting unit P9 estimates a frame-based motion vector of texture for each macroblock and field-based motion vectors of texture for each macroblock, and then selects the one which offers a higher degree of encoding efficiency from all selectable texture encoding modes.
In addition, the texture encoding unit P13 can select either frame-based DCT coding or field-based DCT coding in the case where the texture data P8 is an interlaced signal. As shown in FIG. 31, in the case of the frame-based DCT encoding, each block is composed of lines from the pair of two fields, i.e., top and bottom fields, alternately, and the frame-based DCT encoding process is done per each 8.times.8 block. In the case of the field-based DCT encoding, each block is composed of lines from only one of the two fields, i.e., top and bottom fields, and the field-based DCT encoding process is done per each 8.times.8 block for each of the first and second fields. Accordingly, the generation of high-frequency coefficients in the vertical direction due to the difference in the position of a moving object between the two fields of an interlaced frame can be prevented and hence the power concentration effect can be improved.
After the quantized DCT coefficients undergo reverse quantization, reverse DCT, and an addition to the reference texture data, they are written into the texture memory P16 as the local decoded texture data P15. The local decoded texture data P15 is used for predictions of the later VOP which will be formed later. The texture encoding mode information indicating the selected texture encoding mode: the intra mode, the inter mode with frame-based prediction, or the inter mode with field-based prediction and the DCT encoding mode information indicating the selected DCT encoding mode: the frame-based DCT encoding mode or the field-based DCT encoding mode included in the coded texture data P14 are delivered to the variable length encoding and multiplexing unit P17 and then are multiplexed into the coded bitstream P18 according to the given syntax.
When the VOP to be encoded is an interlaced image, there is a difference in the position of a moving object between the pair of two fields of the interlaced image which has been caused by a difference in time between the two fields of the interlaced image, as previously mentioned. Therefore, in the prior art encoding system mentioned above, the texture encoding process is done by performing a switching operation between the frame-based encoding and the field-based encoding so as to make a correction to the displacement between the pair of two field pictures of the interlaced frame. On the other hand, predictions and encoding are carried out for each frame picture composed of a pair of two fields in the shape encoding process without any correction to the difference in the position of a moving object between the pair of two fields of the interlaced frame. Accordingly, a problem with the prior art encoding system is that the prediction and encoding efficiencies are relatively low due to the difference in the position of a moving object between the pair of two fields of the interlaced frame.