This application is the national phase under 35 U.S.C. xc2xa7371 of prior PCT International Application No. PCT/JP98/00232 which has an International filing date of Jan. 22, 1998 which designated the United States of America, the entire contents of which are hereby incorporated by reference.
The present invention relates to the prediction of a moving picture implemented, for example, in
a moving picture encoder/decoder used in a portable/stationary video communication device and the like for visual communications in a video telephone system, a video conference system or the like,
a moving picture encoder/decoder used in a picture storage/recording apparatus such as a digital VTR and a video server, and
a moving picture encoding/decoding program implemented in the form of a single software or a firmware as a Digital Signal Processor (DSP).
MPEG-4 (Moving Picture Experts Group Phase-4) Video Encoding/Decoding Verification Model (hereinafter referred to by the initials VM) whose standardization is in progress by ISO/IEC JTC1/SC29/WG11 may be introduced as a conventional type of predictive encoding/decoding in an encoding/decoding system of moving pictures. The VM continues to revise its contents according to the progress being made in standardization of MPEG-4. Here, Version 5.0 of the VM is designated to represent the VM and will be simply referred to as VM hereinafter.
The VM is a system for encoding/decoding each video object as one unit in view of a moving picture sequence being an aggregate of video objects changing their shapes time-/space-wise arbitrarily. FIG. 29 shows a VM video data structure. According to the VM, a time-based moving picture object is called a Video Object (VO), and picture data representing each time instance of the VO, as an encoding unit, is called a Video Object Plane (VOP). If the VO is layered in time/space, a special unit called a Video Object Layer (VOL) is provided between the VO and the VOP for representing a layered VO structure. Each VOP includes shape information and texture information to be separated. If the moving picture sequence includes a single VO, then the VOP is equated to a frame. There is no shape information included, in this case, and the texture information alone is then to be encoded/decoded.
The VOP includes alpha data representing the shape information and texture data representing the texture information, as illustrated in FIG. 30. Each data are defined as an aggregate of blocks (alphablocks/macroblocks), and each block in the aggregate is composed of 16xc3x9716 samples. Each alphablock sample is represented in eight bits. A macroblock includes accompanied chrominance signals being associated with 16xc3x9716 sample luminance signals. VOP data are obtained from a moving picture sequence externally processed outside of an encoder.
FIG. 31 is a diagram showing the configuration of a VOP encoder decoding according to the VM encoding system. The diagram includes original VOP data P1 to be inputted, an alphablock P2 representing the shape information of the VOP, a switch P3a for passing the shape information, if there is any, of the inputted original VOP data, a shape encoder P4 for compressing and encoding the alphablock, compressed alphablock data P5, a locally decoded alphablock P6, texture data (a macroblock) P7, a motion detector P8, a motion parameter P9, a motion compensator P10, a predicted picture candidate P11, a prediction mode selector P12, a prediction mode P13, a predicted picture P14, a prediction error signal P15, a texture encoder P16, texture encoding information P17, a locally decoded prediction error signal P18, a locally decoded macroblock P19, a sprite memory update unit P20, a VOP memory P21, a sprite memory P22, a variable-length encoder/multiplexer P23, a buffer P24, and an encoded bitstream P25.
FIG. 32 shows a flowchart outlining an operation of the encoder.
Referring to the encoder of FIG. 31, the original VOP data P1 are decomposed into the alphablocks P2 and the macroblocks P7 (Steps PS2 and PS3). The alphablocks P2 and the macroblocks P7 are transferred to the shape encoder P4 and the motion detector P8, respectively. The shape encoder P4 is a processing block for data compression of the alphablock P2 (step PS4), the process of which is not discussed here further in detail because the compression method of shape information is not particularly relevant to the present invention.
The shape encoder P4 outputs the compressed alphablock data P5 which is transferred to the variable-length encoder/multiplexer P23, and the locally decoded alpha data P6 which is transferred sequentially to the motion detector P8, the motion compensator P10, the prediction mode selector P12, and the texture encoder P16.
The motion detector P8, upon reception of the macroblock P7, detects a local-motion vector on a macroblock basis using reference picture data stored in the VOP memory P21 and the locally decoded alphablock P6 (step PS5). Here, the motion vector is one example of a motion parameter. The VOP memory P21 stores the locally decoded picture of a previously encoded VOP. The content of the VOP memory P21 is sequentially updated with the locally decoded picture of a macroblock whenever the macroblock is encoded. In addition, the motion detector P8 detects a global warping parameter, upon reception of the full texture data of the original VOP, by using reference picture data stored in the sprite memory P22 and locally decoded alpha data. The sprite memory P22 will be discussed later in detail.
The motion compensator P10 generates the predicted picture candidate P11 by using the motion parameter P9, which is detected in the motion detector P8, and the locally decoded alphablock P6 (step PS6). Then, the prediction mode selector P12 determines the final of the predicted picture P14 and corresponding prediction mode P13 of the macroblock by using a prediction error signal power and an original signal power (step PS7). In addition, the prediction mode selector P12 judges the coding type of the data either intra-frame coding or inter-frame coding.
The texture encoder P16 processes the prediction error signal P15 or the original macroblock through Discrete Cosine Transformation (DCT) and quantization to obtain a quantized DCT coefficient based upon the prediction mode P13. An obtained quantized DCT coefficient is transferred, directly or after prediction, to the variable-length encoder/multiplexer P23 to be encoded (steps PS8 and PS9). The variable-length encoder/multiplexer P23 converts the received data into a bitstream and multiplexes the data based upon predetermined syntaxes and variable-length codes (step PS10). The quantized DCT coefficient is subject to dequantization and inverse DCT to obtain the locally decoded prediction error signal P18, which is added to the predicted picture P14, and the locally decoded macroblock P19 (step PS11) is obtained. The locally decoded macroblock P19 is written into the VOP memory P21 and the sprite memory P22 to be used for a later VOP prediction (step PS12).
Dominant portions of prediction including a prediction method, a motion compensation, and the update control of the sprite memory P22 and the VOP memory P21 will be discussed below in detail.
(1) Prediction Method in the VM
Normally, four different types of VOP encoding shown in FIG. 33 are processed in the VM. Each encoding type is associated with a prediction type or method marked by a circle on a macroblock basis. With an I-VOP, intra-frame coding is used singly involving no prediction. With a P-VOP, past VOP data can be used for prediction. With a B-VOP, both past and future VOP data can be used for prediction.
All the aforementioned prediction types are motion vector based. On the other hand, with a Sprite-VOP, a sprite memory can be used for prediction. The sprite is a picture space generated through a step-by-step mixing process of VOPs based upon a warping parameter set
{right arrow over (xcex1)}=(a, b, c, d, e, f, g, h) 
detected on a VOP basis (The mark {right arrow over ( )} denotes a vector hereinafter). The warping parameter set is determined by the following parametric equations.
xxe2x80x2=(ax+by+c)/(gx+hy+1) 
yxe2x80x2=(dx+ey+f)/(gx+hy+1) 
The sprite is stored in the sprite memory P22.
Referring to the parametric equations, (x, y) represents the pixel position of an original VOP in a two-dimensional coordinate system. (xxe2x80x2, yxe2x80x2) represents a pixel position in the sprite memory corresponding to (x, y,) based upon a warping parameter. With the Sprite-VOP, the warping parameter set can be used uniformly with each macroblock to determine (xxe2x80x2, yxe2x80x2) in the sprite memory for prediction to generate a predicted picture. In a strict sense, the sprite includes xe2x80x9cDynamic Spritexe2x80x9d used for prediction and xe2x80x9cStatistic Spritexe2x80x9d used for prediction as well as for another purpose of an approximate representation of VOP at a decoding station. In FIGS. 34 through 37 below, xe2x80x9cspritexe2x80x9d stands for Dynamic Sprite.
The motion detector P8 detects the motion vector and the warping parameter to be used for the aforementioned prediction types. The motion vectors and the warping parameters are generically called the motion parameter P9 hereinafter.
(2) Motion Compensation
FIG. 34 is a diagram showing the configuration of the motion compensator P10 in detail. In the figure, a warping parameter P26, a motion vector P27, a global-motion compensator P28, a local-motion compensator P29, a warping-parameter based predicted picture candidate P30, and a motion-vector based predicted picture candidate P31 are shown. The warping-parameter and motion-vector based predicted picture candidates 30, 31 are generically called the predicted picture candidates P11 hereinafter.
FIG. 35 shows a flowchart outlining the operation of the motion compensator P10 including steps PS14 through PS21.
The motion compensator P10 generates the predicted picture candidate P11 using the warping parameter P26 of a full VOP detected on a macroblock P7 basis in the motion detector P8 or a macroblock based motion vector P27. The global-motion compensator P28 performs a motion compensation using the warping parameter P26, and the local-motion compensator P29 performs a motion compensation using the motion vector P27.
With the I-VOP, the motion compensator P10 does not operate. (The operating step proceeds to step PS21 from step PS14.) With a VOP other than the I-VOP, the local-motion compensator P29 reads out a predicted picture candidate PR1 from the locally decoded picture of a past VOP stored in the VOP memory P21 by using the motion vector P27 (step PS15). With the P-VOP, the predicted picture candidate PR1 is only available to be used.
When the B-VOP is identified in step PS16, the local-motion compensator P29 further reads out a predicted picture candidate PR2 from the locally decoded picture of a future VOP stored in the VOP memory P21 by using the motion vector P27 (step PS17). In addition, an arithmetic mean of the predicted picture candidates PR1, PR2 obtained from the past and future VOP locally decoded pictures to obtain a predicted picture candidate PR3 (step PS18).
A predicted picture candidate PR4 is generated also through Direct Prediction (step PS19). (Direct Prediction is based upon a prediction method corresponding to B-Frame in an encoding method H.263, Recommendation ITU-T. A vector for B-Frame is produced based upon a group of P-VOP vectors, which is not discussed further here in detail.) In FIG. 34, the motion-vector based predicted picture candidates P31 is a generic term for all or part of the predicted picture candidates PR1 through PR4.
If a VOP is of neither I-VOP nor B-VOP, then the VOP is of Sprite-VOP. With the Sprite-VOP, the predicted picture candidate PR1 is read out from the VOP memory based upon the motion vector. In addition, the global-motion compensator P28 reads out the predicted picture candidate P30 from the sprite memory P22 based upon the warping parameter P26 in step PS20.
The global-motion compensator P28 calculates the address of a predicted picture candidate in the sprite memory P22 based upon the warping parameter P26, and reads out the predicted picture candidate P30 from the sprite memory P22 to be outputted based upon a resultant address. The local-motion compensator P29 calculates the address of a predicted picture candidate in the VOP memory P21 based upon the motion vector P27 and reads out the predicted picture candidate P31 to be outputted based upon a resultant address.
These predicted picture candidates P11 are evaluated along with an intra-frame coding signal of the texture data P7 in the prediction mode selector P12, which selects a predicted picture candidate having the least power of a prediction error signal along with a prediction mode.
(3) Updating of Memories
The memory update unit P20 controls the VOP memory P21 and sprite memory P22 to be updated (step PS12). The contents of these memories are updated regardless of the prediction mode P13 selected on a macroblock basis.
FIG. 36 is a diagram showing the configuration of the memory update unit P20. FIG. 37 shows a flowchart including steps PS22 through PS28 illustrating the operation of the memory update unit P20.
In FIG. 36, an externally supplied VOP encoding type P32, an externally supplied sprite prediction identification flag P33 for indicating the use of the sprite memory for prediction, an externally supplied blend factor P34 used for prediction with the sprite memory, switches P35, P36, a sprite blender P37, a sprite transformer P38, a VOP memory update signal P39, and a sprite memory update signal P40 are shown.
Firstly, the use of the sprite with the current VO or VOL is examined if being designated by the sprite prediction identification flag P33 (step PS22). With no use of the sprite designated, the data are examined if being the B-VOP (step PS27). With the B-VOP, then no updating is performed with the VOP memory P21. With either the I-VOP or the P-VOP, then the VOP memory P21 is written over with the locally decoded macroblock P19 on a macroblock basis (step PS28).
With the use of the sprite designated in step PS22, then the VOP memory P21 is updated in the same manner as above (steps PS23, PS24), and in addition, the sprite memory PS22 is updated through the following procedure.
a) Sprite Warping (Step PS25)
In the sprite transformer P38, an area
M({right arrow over (R)},txe2x88x921) 
in the sprite memory P22 (M({right arrow over (R)},txe2x88x921) is an area having the same size as that of a VOP having the origin of the coordinates at a position in the sprite memory P22 with the VOP at a time t) is subject to warping (transformation) based upon a warping parameter
{right arrow over (xcex1)}=(a, b, c, d, e, f, g, h). 
b) Sprite Blending (Step PS26)
By using a resultant warped picture from a) above, a new sprite memory area is calculated in the sprite blender P37 according to the following expression,
M({right arrow over (R)},t)=(1xe2x88x92xcex1)xc2x7Wb[M({right arrow over (R)},txe2x88x921), {right arrow over (xcex1)}]+xcex1xc2x7VO({right arrow over (r)},t), 
where xcex1 is the blend factor P34, Wb[M,{right arrow over (xcex1)}] is the resultant warped picture, and VO({right arrow over (r)},t) is a pixel value of a locally decoded VOP with a location {right arrow over (r)} and a time t.
With a non-VOP area in a locally decoded macroblock, it is assumed that
VO({right arrow over (r)},t)=0. 
As the blend factor xcex1 is assigned on a VOP basis, a locally decoded VOP is collectively blended into the sprite memory P22 based upon a weight xcex1, regardless of the contents of a VOP area.
According to the aforementioned prediction system in the conventional encoding system, the video object is predicted by using the memory designed to be used for detecting the motion vector alone and the memory designed to be used for detecting the warping parameter alone, both of which are structurally allowed the maximum use of a single screen alone each. Thus, the limited use of reference pictures is only available for prediction, thereby hindering a sufficient improvement in prediction efficiency.
Further, in such a system where two or more video objects are encoded concurrently, these memories only include a reference picture representing the past record of a video object to be predicted alone, which limits the variation of a reference picture and precludes the utilization of a correlation among video objects for prediction.
Further, the memories are updated regardless of such items as the internal structure, a characteristic, and the past record of the video object. This results in the insufficient storage of information lacking significant data for predicting a video object, thereby posing a problem of failing to enhance prediction efficiency.
The present invention is directed to solving the aforementioned problems. An objective of this invention is to provide the prediction system for encoding/decoding of picture data where two or more memories are provided to store the past record of the moving picture sequence effectively in consideration of the internal structure and characteristic of the moving picture sequence, thereby achieving a highly efficient prediction as well as encoding/decoding. In addition, the prediction system provides a sophisticated inter-video object prediction performing among two or more video objects.
According to the present invention, a moving picture prediction system, for predicting a moving picture to be implemented in at least one of an encoder and a decoder, includes a plurality of memories for storing picture data for reference to be used for prediction, the plurality of memories being corresponding to different transform methods, respectively, and a prediction picture generation section for receiving a parameter representing a motion of a picture segment to be predicted, and for generating a predicted picture using the picture data stored in one of the plurality of memories used for the picture segment to be predicted based upon the parameter and one of the transform methods corresponding to the one of the plurality of memories.
The encoder generates a prediction memory indication information signal indicating the one of the plurality of memories used for generating the predicted picture and transmits the prediction memory indication information signal and the parameter to a decoding station so as to generate the predicted picture using the picture data stored in the one of the plurality of memories based upon the one of the transform methods corresponding to the one of the plurality of memories in the decoding station.
The decoder receives the parameter and a prediction memory indication information signal indicating the one of the plurality of memories used for generating the predicted picture from an encoding station, wherein the prediction picture generation section generates the predicted picture using the picture data stored in the one of the plurality of memories based upon the parameter and the one of the transform methods corresponding to the one of the plurality of memories.
Further, according to the present invention, a moving picture prediction system, for predicting a moving picture to be implemented in at least one of an encoding and a decoding, includes a plurality of memories for storing picture data for reference to be used for prediction, the plurality of memories being assigned to different parameter effective value ranges, respectively, and a prediction picture generation section for receiving a parameter representing a motion of a picture segment to be predicted, for selecting one of the plurality of memories assigned to one of the parameter effective value ranges including a value of the parameter, and for generating a predicted picture using the picture data stored in a selected memory.
Still further, according to the present invention, a moving picture prediction system, for predicting a moving picture to be implemented in at least one of an encoding and a decoding, includes a plurality of memories for storing picture data for reference to be used for prediction and a prediction picture generation section including a motion compensator for receiving a parameter representing a motion of a picture segment to be predicted, and for generating a predicted picture by using the picture data stored in the plurality of memories based upon the parameter, and a memory update unit for updating the picture data stored in at least one of the plurality of memories at an arbitrary timing.
The moving picture prediction system predicts the moving picture in a moving picture sequence having first and second video objects, wherein the plurality of memories includes separate first and second pluralities of memories corresponding to the first and second video objects, respectively, and the prediction picture generation section includes separate first and second generators, respectively, corresponding to the first and second video objects, wherein the first generator uses the picture data stored in at least one of the first and second pluralities of memories to generate the predicted picture when predicting the first object, and generates information indicating a use of the second plurality of memories for predicting the first object, the information being added to the predicted picture.
The prediction picture generation section generates the predicted picture through a change of either one of a number and a size of the plurality of memories in response to a change in the moving picture at each time instance.
The prediction picture generation section generates the predicted picture in a limited use of memories for prediction in response to a change in the moving picture at each time instance.
The prediction picture generation section generates the predicted picture by calculating a plurality of the predicted pictures generated by using the respective picture data stored in the plurality of memories.
The moving picture prediction system further includes a significance detector for detecting a feature parameter representing a significance of the picture segment to be predicted, wherein the prediction picture generation section generates the predicted picture by selecting at least one of choices of at least one of a plurality of prediction methods, the plurality of memories, and a plurality of memory update methods.
The moving picture prediction system further includes a significance detector for detecting a parameter representing at least one of an amount of bits available for coding the picture segment to be predicted, an amount of change of the picture segment at each time instance, and a significance of the picture segment, wherein the prediction picture generation section generates the predicted picture by selecting at least one of choices of at least one of a plurality of prediction methods, the plurality of memories, a plurality of memory update methods.
The moving picture prediction system predicts the moving picture on a video object basis, wherein the moving picture prediction system further includes a significance detector for detecting a parameter representing at least one of an amount of bits available for coding a video object to be predicted, an amount of change in the video object at each time instance, and a significance of the video object, wherein the prediction picture generation section generates the predicted picture by selecting at least one of choices of at least one of a plurality of prediction methods, the plurality of memories, and a plurality of memory update methods.
The moving picture prediction system further includes a prediction information encoder for encoding prediction relating information of the moving picture, wherein the prediction picture generation section counts times of a memory used for prediction and determines a rank of the plurality of memories based upon a counted number of the times, wherein the prediction information encoder allocates a code length to the prediction relating information to be encoded based upon the rank of a memory used for prediction.
The plurality of memories includes at least a frame memory for storing the picture data on a frame basis and a sprite memory for storing a sprite picture.
The sprite memory includes at least one of a dynamic sprite memory involving a regular updating, and a static sprite memory not involving the regular updating.
The one of the transform methods corresponding to the one of the plurality of memories is at least one of a parallel translation, an affine transformation, and a perspective transformation in an interchangeable manner.
Still further, according to the present invention, a method for predicting a moving picture to be implemented in at least one of an encoding or a decoding, includes the steps of storing picture data for reference to be used for prediction in a plurality of memories, corresponding different transform methods with the plurality of memories, respectively, receiving a parameter representing a motion of a picture segment to be predicted, and generating a predicted picture using the picture data stored in one of the plurality of memories used for predicting the picture segment based upon the parameter and one of the transform methods being corresponding to the one of the plurality of memories.
The method for predicting a moving picture further includes the steps of generating a prediction memory indication information signal indicating the one of the plurality of memories used for the picture segment to be predicted, and transmitting the prediction memory indication information signal and the parameter to a decoding station.
The method for predicting a moving picture is implemented in the decoding, and further includes the step of receiving a prediction memory indication information signal indicating the one of the plurality of memories used for generating the predicted picture and the parameter representing a motion of the picture segment to be predicted from an encoding station.
Still further, according to the present invention, a method, for predicting a moving picture to be implemented in at least one of an encoding and a decoding, includes the steps of storing picture data for reference to be used for prediction in a plurality of memories, assigning separate parameter effective value ranges to the plurality of memories, respectively, receiving a parameter representing a motion of a picture segment to be predicted, selecting one of the plurality of memories assigned to one of the parameter effective value ranges including a value of the parameter, and generating a predicted picture using the picture data stored in a selected memory.
Still further, according to the present invention, a method, for predicting a moving picture to be implemented in at least one of an encoding and a decoding, includes the steps of storing picture data for reference to be used for prediction in a plurality of memories, receiving a parameter representing a motion of a picture segment to be predicted, generating a predicted picture using the picture data stored in the plurality of memories based upon the parameter, and updating the picture data stored in at least one of the plurality of memories at an arbitrary timing.