MPEG-4 (Moving Picture Experts Group Phase-4) Video Encoding/Decoding Verification Model (hereinafter referred to by the initials VM) whose standardization is in progress by ISO/IEC JTC1/SC29/WG11 may be introduced as a conventional type of predictive encoding/decoding in an encoding/decoding system of moving pictures. The VM continues to revise its contents according to the progress being made in standardization of MPEG-4. Here, Version 5.0 of the VM is designated to represent the VM and will be simply referred to as VM hereinafter.
The VM is a system for encoding/decoding each video object as one unit in view of a moving picture sequence being an aggregate of video objects changing their shapes time-/space-wise arbitrarily. FIG. 29 shows a VM video data structure. According to the VM, a time-based moving picture object is called a Video Object (VO), and picture data representing each time instance of the VO, as an encoding unit, is called a Video Object Plane (VOP). If the VO is layered in time/space, a special unit called a Video Object Layer (VOL) is provided between the VO and the VOP for representing a layered VO structure. Each VOP includes shape information and texture information to be separated. If the moving picture sequence includes a single VO, then the VOP is equated to a frame. There is no shape information included, in this case, and the texture information alone is then to be encoded/decoded.
The VOP includes alpha data representing the shape information and texture data representing the texture information, as illustrated in FIG. 30. Each data are defined as an aggregate of blocks (alphablocks/macroblocks), and each block in the aggregate is composed of 16×16 samples. Each alphablock sample is represented in eight bits. A macroblock includes accompanied chrominance signals being associated with 16×16 sample luminance signals. VOP data are obtained from a moving picture sequence externally processed outside of an encoder.
FIG. 31 is a diagram showing the configuration of a VOP encoder according to the VM encoding system. The diagram includes original VOP data P1 to be inputted, an alphablock P2 representing the shape information of the VOP, a switch P3a for passing the shape information, if there is any, of the inputted original VOP data, a shape encoder P4 for compressing and encoding the alphablock, compressed alphablock data P5, a locally decoded alphablock P6, texture data (a macroblock) P7, a motion detector P8, a motion parameter P9, a motion compensator P10, a predicted picture candidate P11, a prediction mode selector P12, a prediction mode P13, a predicted picture P14, a prediction error signal P15, a texture encoder P16, texture encoding information P17, a locally decoded prediction error signal P18, a locally decoded macroblock P19, a sprite memory update unit P20, a VOP memory P21, a sprite memory P22, a variable-length encoder/multiplexer P23, a buffer P24, and an encoded bitstream P25.
FIG. 32 shows a flowchart outlining an operation of the encoder.
Referring to the encoder of FIG. 31, the original VOP data P1 are decomposed into the alphablocks P2 and the macroblocks P7 (Steps PS2 and PS3). The alphablocks P2 and the macroblocks P7 are transferred to the shape encoder P4 and the motion detector P8, respectively. The shape encoder P4 is a processing block for data compression of the alphablock P2 (step PS4), the process of which is not discussed here further in detail because the compression method of shape information is not particularly relevant to the present invention.
The shape encoder P4 outputs the compressed alphablock data P5 which is transferred to the variable-length encoder/multiplexer P23, and the locally decoded alpha data P6 which is transferred sequentially to the motion detector P8, the motion compensator P10, the prediction mode selector P12, and the texture encoder P16.
The motion detector P8, upon reception of the macroblock P7, detects a local-motion vector on a macroblock basis using reference picture data stored in the VOP memory P21 and the locally decoded alphablock P6 (step PS5). Here, the motion vector is one example of a motion parameter. The VOP memory P21 stores the locally decoded picture of a previously encoded VOP. The content of the VOP memory P21 is sequentially updated with the locally decoded picture of a macroblock whenever the macroblock is encoded. In addition, the motion detector P8 detects a global warping parameter, upon reception of the full texture data of the original VOP, by using reference picture data stored in the sprite memory P22 and locally decoded alpha data. The sprite memory P22 will be discussed later in detail.
The motion compensator P10 generates the predicted picture candidate P11 by using the motion parameter P9, which is detected in the motion detector P8, and the locally decoded alphablock P6 (step PS6). Then, the prediction mode selector P12 determines the final of the predicted picture P14 and corresponding prediction mode P13 of the macroblock by using a prediction error signal power and an original signal power (step PS7). In addition, the prediction mode selector P12 judges the coding type of the data either intra-frame coding or inter-frame coding.
The texture encoder P16 processes the prediction error signal P15 or the original macroblock through Discrete Cosine Transformation (DCT) and quantization to obtain a quantized DCT coefficient based upon the prediction mode P13. An obtained quantized DCT coefficient is transferred, directly or after prediction, to the variable-length encoder/multiplexer P23 to be encoded (steps PS8 and PS9). The variable-length encoder/multiplexer P23 converts the received data into a bitstream and multiplexes the data based upon predetermined syntaxes and variable-length codes (step PS10). The quantized DCT coefficient is subject to dequantization and inverse DCT to obtain the locally decoded prediction error signal P18, which is added to the predicted picture P14, and the locally decoded macroblock P19 (step PS11) is obtained. The locally decoded macroblock P19 is written into the VOP memory P21 and the sprite memory P22 to be used for a later VOP prediction (step PS12).
Dominant portions of prediction including a prediction method, a motion compensation, and the update control of the sprite memory P22 and the VOP memory P21 will be discussed below in detail.
(1) Prediction Method in the VM
Normally, four different types of VOP encoding shown in FIG. 33 are processed in the VM. Each encoding type is associated with a prediction type or method marked by a circle on a macroblock basis. With an I-VOP, intra-frame coding is used singly involving no prediction. With a P-VOP, past VOP data can be used for prediction. With a B-VOP, both past and future VOP data can be used for prediction.
All the aforementioned prediction types are motion vector based. On the other hand, with a Sprite-VOP, a sprite memory can be used for prediction. The sprite is a picture space generated through a step-by-step mixing process of VOPs based upon a warping parameter set{right arrow over (α)}=(a, b, c, d, e, f, g, h)detected on a VOP basis (The mark → denotes a vector hereinafter). The warping parameter set is determined by the following parametric equations.x′=(ax+by+c)/(gx+hy+1)y′=(dx+ey+f)/(gx+hy+1)The sprite is stored in the sprite memory P22.
Referring to the parametric equations, (x, y) represents the pixel position of an original VOP in a two-dimensional coordinate system. (x′, y′) represents a pixel position in the sprite memory corresponding to (x, y,) based upon a warping parameter. With the Sprite-VOP, the warping parameter set can be used uniformly with each macroblock to determine (x′, y′) in the sprite memory for prediction to generate a predicted picture. In a strict sense, the sprite includes “Dynamic Sprite” used for prediction and “Statistic Sprite” used for prediction as well as for another purpose of an approximate representation of VOP at a decoding station. In FIGS. 34 through 37 below, “sprite” stands for Dynamic Sprite.
The motion detector Pg detects the motion vector and the warping parameter to be used for the aforementioned prediction types. The motion vectors and the warping parameters are generically called the motion parameter P9 hereinafter.
(2) Motion Compensation
FIG. 34 is a diagram showing the configuration of the motion compensator P1 in detail. In the figure, a warping parameter P26, a motion vector P27, a global-motion compensator P28, a local-motion compensator P29, a warping-parameter based predicted picture candidate P30, and a motion-vector based predicted picture candidate P31 are shown. The warping-parameter and motion-vector based predicted picture candidates 30, 31 are generically called the predicted picture candidates P11 hereinafter.
FIG. 35 shows a flowchart outlining the operation of the motion compensator P10 including steps PS14 through PS21.
The motion compensator P10 generates the predicted picture candidate P11 using the warping parameter P26 of a full VOP detected on a macroblock P7 basis in the motion detector P8 or a macroblock based motion vector P27. The global-motion compensator P28 performs a motion compensation using the warping parameter P26, and the local-motion compensator P29 performs a motion compensation using the motion vector P27.
With the I-VOP, the motion compensator P10 does not operate. (The operating step proceeds to step PS21 from step PS14.) With a VOP other than the I-VOP, the local-motion compensator P29 reads out a predicted picture candidate PR1 from the locally decoded picture of a past VOP stored in the VOP memory P21 by using the motion vector P27 (step PS15). With the P-VOP, the predicted picture candidate PR1 is only available to be used.
When the B-VOP is identified in step PS16, the local-motion compensator P29 further reads out a predicted picture candidate PR2 from the locally decoded picture of a future VOP stored in the VOP memory P21 by using the motion vector P27 (step PS17). In addition, an arithmetic mean of the predicted picture candidates PR1, PR2 obtained from the past and future VOP locally decoded pictures to obtain a predicted picture candidate PR3 (step PS18).
A predicted picture candidate PR4 is generated also through Direct Prediction (step PS19). (Direct Prediction is based upon a prediction method corresponding to B-Frame in an encoding method H.263, Recommendation ITU-T. A vector for B-Frame is produced based upon a group of P-VOP vectors, which is not discussed further here in detail.) In FIG. 34, the motion-vector based predicted picture candidates P31 is a generic term for all or part of the predicted picture candidates PR1 through PR4.
If a VOP is of neither I-VOP nor B-VOP, then the VOP is of Sprite-VOP. With the Sprite-VOP, the predicted picture candidate PR1 is read out from the VOP memory based upon the motion vector. In addition, the global-motion compensator P28 reads out the predicted picture candidate P30 from the sprite memory P22 based upon the warping parameter P26 in step PS20.
The global-motion compensator P28 calculates the address of a predicted picture candidate in the sprite memory P22 based upon the warping parameter P26, and reads out the predicted picture candidate P30 from the sprite memory P22 to be outputted based upon a resultant address. The local-motion compensator P29 calculates the address of a predicted picture candidate in the VOP memory P21 based upon the motion vector P27 and reads out the predicted picture candidate P31 to be outputted based upon a resultant address.
These predicted picture candidates P11 are evaluated along with an intra-frame coding signal of the texture data P7 in the prediction mode selector P12, which selects a predicted picture candidate having the least power of a prediction error signal along with a prediction mode.
(3) Updating of Memories
The memory update unit P20 controls the VOP memory P21 and sprite memory P22 to be updated (step PS12). The contents of these memories are updated regardless of the prediction mode P13 selected on a macroblock basis.
FIG. 36 is a diagram showing the configuration of the memory update unit P20. FIG. 37 shows a flowchart including steps PS22 through PS28 illustrating the operation of the memory update unit P20.
In FIG. 36, an externally supplied VOP encoding type P32, an externally supplied sprite prediction identification flag P33 for indicating the use of the sprite memory for prediction, an externally supplied blend factor P34 used for prediction with the sprite memory, switches P35, P36, a sprite blender P37, a sprite transformer P38, a VOP memory update signal P39, and a sprite update signal P40 are shown.
Firstly, the use of the sprite with the current VO or VOL is examined if being designated by the sprite prediction identification flag P33 (step PS22). With no use of the sprite designated, the data are examined if being the B-VOP (step PS27). With the B-VOP, then no updating is performed with the VOP memory P21. With either the I-VOP or the P-VOP, then the VOP memory P21 is written over with the locally decoded macroblock P19 on a macroblock basis (step PS28).
With the use of the sprite designated in step PS22., then the VOP memory P21 is updated in the same manner as above (steps PS23, PS24), and in addition, the sprite memory PS22 is updated through the following procedure.
a) Sprite Warping (Step PS25)
In the sprite transformer P38, an areaM({right arrow over (R)},t−1)in the sprite memory P22 (M({right arrow over (R)},t−1) is an area having the same size as that of a VOP having the origin of the coordinates at a position in the sprite memory P22 with the VOP at a time t) is subject to warping (transformation) based upon a warping parameter{right arrow over (α)}=(a,b,c,d,e,f,g,h).b) Sprite Blending (Step PS26)
By using a resultant warped picture from a) above, a new sprite memory area is calculated in the sprite blender P37 according to the following expression,M({right arrow over (R)},t)=(1−α)·Wb[M({right arrow over (R)},t−1), {right arrow over (α)}]+α·VO({right arrow over (r)},t),where α is the blend factor P34, Wb[M, {right arrow over (α)}] is the resultant warped picture, and VO({right arrow over (r)},t) is a pixel value of a locally decoded VOP with a location {right arrow over (r)} and a time t.
With a non-VOP area in a locally decoded macroblock, it is assumed thatVO({right arrow over (r)},t)=0.As the blend factor α is assigned on a VOP basis, a locally decoded VOP is collectively blended into the sprite memory P22 based upon a weight α, regardless of the contents of a VOP area.
According to the aforementioned prediction system in the conventional encoding system, the video object is predicted by using the memory designed to be used for detecting the motion vector alone and the memory designed to be used for detecting the warping parameter alone, both of which are structurally allowed the maximum use of a single screen alone each. Thus, the limited use of reference pictures is only available for prediction, thereby hindering a sufficient improvement in prediction efficiency.
Further, in such a system where two or more video objects are encoded concurrently, these memories only include a reference picture representing the past record of a video object to be predicted alone, which limits the variation of a reference picture and precludes the utilization of a correlation among video objects for prediction.
Further, the memories are updated regardless of such items as the internal structure, a characteristic, and the past record of the video object. This results in the insufficient storage of information lacking significant data for predicting a video object, thereby posing a problem of failing to enhance prediction efficiency.
The present invention is directed to solving the aforementioned problems. An objective of this invention is to provide the prediction system for encoding/decoding of picture data where two or more memories are provided to store the past record of the moving picture sequence effectively in consideration of the internal structure and characteristic of the moving picture sequence, thereby achieving a highly efficient prediction as well as encoding/decoding. In addition, the prediction system provides a sophisticated inter-video object prediction performing among two or more video objects.