1. Field of the invention
This invention relates to a digital moving picture encoding method and apparatus, a moving picture decoding method and apparatus and a recording medium, presupposed on transmission of moving picture signals by a transmission apparatus having variable transmission rates, such as analog or digital telephone networks or dedicated data transmission networks and recording on storage mediums having variable recording capacities, such as optical/magnetic discs or random access memories (RAMs).
2. Description of the Related Art
Among picture encoding systems, there is an encoding technique termed an object scalable encoding system. This encoding system divides a picture into groups termed objects and performs encoding from one object to another.
For example, in object-scalable encoding of a picture V1 made up of a human being and the background, the picture V1 is divided into an object representing the human being and an object representing the background. A picture V2 constituting an object representing a human being and a picture V3 constituting an object representing the background are encoded independently. This enables control such as to finely quantize the picture of the object V2 of the human being and to coarsely quantize the picture V3 of the object of the background and control of encoding all frames of the object V2 of the human being and encoding the picture V3 of the object of the background at a rate of one to a number of frames. This encoding technique has the advantage that the subjective picture quality can be improved for the same amount of generated codes or that the amount of generated codes can be decreased for the same subjective picture quality.
For realizing this object scalable encoding, it is necessary to encode the shape of an object other than the usually encoded texture image (or simply the texture) representing the brightness and color tone of the picture. The object shape is termed the shape picture or simply the shape. It is also occasionally termed key signals. In the example of FIG. 1, the picture V2 of the object of the human being is divided into a texture picture V2a and a shape picture V2b, these pictures V2a and V2b being encoded independently.
The signals representing the shape are classed into hard key signals and soft key signals. The hard key signals are bi-valued pictures representing the inside or the outside of the picture. If a pixel is indicated as being the inside of the object, the texture of the object is used as an output picture. If a pixel is indicated as being the outside of the object, the texture of the background is used as an output picture. The soft key signal is a multi-valued picture representing the ratio of synthesizing the texture inside the object and that outside the object. In a pixel the value of the soft key signal of which is maximum, the pixel value of the texture of the object is directly used as an output picture, whereas, in a pixel the value of the soft key signal of which is minimum, the pixel value of the texture of the background is directly used as an output picture. If the pixel is of an intermediate value, the pixel values of both textures are synthesized, depending on the pixel value, and the resulting synthesized pixel value is used as an output picture.
On the other hand, in a system for transmitting or storing moving picture signals, the picture signals may be compression-coded by exploiting the intra-frame or inter-frame correlation of the moving picture signals for enabling efficient utilization of the transmission channel or the storage medium. Among the techniques of compression-coding the moving picture signals, there is an encoding system standardized by the research organization for encoding the moving pictures, termed MPEG (Moving Picture Image Coding Experts Group).
As the encoding method for picture signals exploiting the above-mentioned intra-frame correlation, orthogonal transform concentrating the coefficients for encoding is frequently used as far as the texture is concerned, while a method based on the so-called MMR (modified modified read) or a method based on JBIG (Joint Bi-level Image Coding Experts Group) is conceived as far as the shape is concerned.
As a method utilizing the above-mentioned inter-frame correlation, motion compensated inter-frame prediction is frequently used. The principle of this motion compensated inter-frame prediction is now explained with reference to FIG. 2.
It is assumed that pictures P1 and P2 have been generated at time points t1 and t2, the picture P1 is already sent and the picture P2 is being newly sent, as shown in FIG. 2. At this time, the picture P2 is divided into plural blocks for each of which the amount of motion (motion vector) between it and the picture P1 is detected. The picture P1 is moved in translation in an amount equal to the motion vector to give a prediction picture for the block, and a difference picture between the prediction picture and the block of the picture P2 is found. The difference picture and the motion vector are encoded by way of the above-mentioned motion compensated inter-frame prediction.
Since the motion compensated inter-frame prediction is effective both for encoding the texture and for encoding the shape, the motion compensated inter-frame prediction is used in the object scalable encoding for these two. Since the motion vector of the texture is correlated with that of the shape, it is practised to use the motion vector for the texture for predicting the motion vector of the shape.
FIG. 3 shows an illustrative structure of an encoding device for the shape moving picture and the texture motion vector exploiting the above-mentioned motion compensated inter-frame prediction and the motion vector prediction, while FIG. 4 shows the structure of a decoding device which is a counterpart of the encoding device.
The moving picture encoding device shown in FIG. 3 encodes the shape moving picture entering a shape input terminal 101 and a texture motion vector entering a texture input terminal 108 to output the resulting encoded signals at a code output terminal 112.
The texture entering the texture input terminal 108 is sent to a texture motion detector 109 and to a texture encoder 111. The texture motion detector 109 detects the amount of motion between the input texture and the locally decoded texture picture locally decoded by the texture encoder 111 as later explained to output the texture motion vector on the block basis. In detecting the texture motion vector, a locally decoded shape picture as later explained is used. That is, since the texture motion vector is detected on the block basis, the locally decoded shape picture is used to omit the detecting operation of the background portion if the block contains an edge between the human being and the background.
The texture motion vector detected by the texture motion detector 109 is sent to a texture motion compensation unit 110 and to the texture motion vector encoder 106 for texture encoding, while also being sent to a shape motion detector 102 and to a shape motion vector encoder 105 for shape encoding. The texture motion compensation unit 110 creates a prediction texture picture from the locally decoded picture, using the texture motion vector, and enters the picture to the texture encoder 111. The texture encoder 111 encodes the input texture on the block basis. The texture motion vector encoder 106 calculates the difference between the texture motion vector and the texture motion vector of a previously encoded block to encode the resulting difference texture motion vector.
The shape entered from the shape input terminal 101 is sent to the shape motion detector 102 and to a shape encoder 104 as later explained. The shape motion detector 102 detects the amount of motion between the input shape and the locally decoded shape picture locally decoded by the shape decoder 104, on the block basis. In detecting this shape motion vector, reference is had to the texture motion vector in order to find the motion vector having a lesser difference from the texture motion vector so as not to increase the amount of generated bits at the time of encoding the shape motion vector as later explained. The detected shape motion vector is entered to a shape motion compensation unit 103 and to the shape motion vector encoder 105 for shape encoding. The shape motion vector encoder 105 calculates a difference between the shape motion vector and the texture motion vector of the previously encoded block to encode the difference shape motion vector. The shape motion compensation unit 103 generates a prediction shape picture from the locally decoded shape picture, using the shape motion vector, to enter the produced prediction shape picture in the shape encoder 104. The shape decoder 104 encodes the input shape, based on the prediction shape picture, from one block to another.
Output signals of the shape decoder 104, shape motion vector encoder 105, texture motion vector encoder 106 and the texture encoder 111 are multiplexed by a multiplexer 107 so as to be outputted as encoded data at a code output terminal 112. This encoded data is transmitted over a communication network to a receiving side, or recorded on a recording medium for later reproduction by a reproducing device.
The encoding method for the shape motion vector and the texture motion vector is summarized. The texture motion vector is encoded as a difference from the texture motion vector of the previously detected block (difference texture motion vector). The shape motion vector is encoded as a difference from the texture motion vector of the previously encoded block, that is the directly previous block (difference shape motion vector).
For texture encoding, the locally decoded shape picture must be previously found. The texture encoder 111 encodes the texture on the block basis. If this block contains an edge between the human being and the background, and the texture within the block is encoded in this state, high frequency components are produced to disable efficient data encoding. Thus, if a block contains an edge, the processing of substituting pixels of the background portion for the pixels of an edge portion with the human being is performed by exploiting the locally decoded shape picture.
The time flow of encoding of the above-mentioned shape motion vector and texture motion vector is as shown in FIG. 5. The processing of FIG. 5 is iteration from block to block. The following processing is carried out for each block.
First, at step ST101, one of the previously encoded texture motion vectors (usually, the texture motion vector lying on the left or upper side of the block being encoded) is selected, and a difference shape motion vector between the texture motion vector and the shape motion vector of the block being encoded is calculated and encoded.
Then, at step ST102, the shape is encoded, using the shape motion vector, and the resulting encoded shape is locally decoded to find the locally decoded shape picture. Then, at step ST103, the texture motion vector is found using the locally decoded shape picture. As for the texture motion vector, a difference between the texture motion vector of the block for encoding and the previously encoded texture motion vector is calculated and encoded.
Then, at step ST104, the texture is encoded, using the texture motion vector, and the resulting encoded texture is locally decoded to find the locally decoded picture. Finally, at step ST106, it is judged whether or not the processing for all blocks has come to a close. If the processing has not come to a close, processing reverts to step ST101 to repeat the above processing. If the processing has come to a close, the flow of the flowchart is terminated.
The reciprocal reference between the texture motion vector and the shape motion vector is as shown in FIG. 6, from which it is seen that, in encoding the texture motion vector and the shape motion vector of the blocks B101 to B103, reference is had to the texture motion vector of the previously detected (encoded) other blocks and the differences (residuals) is encoded.
The decoding device for the moving pictures of the shape and the texture, shown in FIG. 4, outputs the shape moving picture, decoded from code data entering a code input terminal 121, at a shape output terminal 127, while outputting the texture moving picture at a texture output terminal 130.
That is, in FIG. 4, the encoded data from a transmission network, received by a receiving device, not shown, or encoded data from a recording medium, reproduced by the reproducing device, are separated by a demultiplexer 122 into codes of the shape, shape motion vector, texture and the texture motion vector.
The separated codes are sent to associated decoders, that is a shape decoder 126, a shape motion vector decoder 123, a texture decoder 129 and to a texture motion vector decoder 124 for decoding. The texture motion vector decoder 124 decodes the input codes to generate a difference texture motion vector. The texture motion vector of the previously decoded block (texture motion vector of a block lying on the left or upper side of the block being decoded) is summed to the difference texture motion vector to decode the texture motion vector. This texture motion vector is entered to a texture motion compensation unit 128 and to a shape motion vector decoder 123.
The shape motion vector decoder 123 decodes the input code to generate the difference shape motion vector. The texture motion vector of the previously decoded block (texture motion vector of a block lying on the left or upper side of the block being decoded) is summed to the difference shape motion vector to decode the shape motion vector. This shape motion vector is entered to a shape motion compensation unit 125.
The shape motion compensation unit 125 generates a prediction shape picture, using the shape motion vector and the decoded shape picture of a shape decoder 126, as later explained, to send the generated prediction shape picture to the shape decoder 126. The shape decoder 126 decodes the codes from the demultiplexer 122 and the prediction shape picture to produce a decoded shape picture which is outputted. This decoded shape picture is sent both to a shape output terminal 127 and to the shape motion compensation unit 125.
The texture motion compensation unit 128 generates a prediction texture picture, using the texture motion vector and the decoded texture picture of a texture decoder 129, as later explained, and sends the generated prediction texture picture to the texture decoder 129. The texture decoder 129 decodes the code from the demultiplexer 122 and the prediction texture picture to produce a decoded texture picture which is outputted. This decoded texture picture is sent both to a texture output terminal 130 and to the texture motion compensation unit 128.
Although not shown, this decoded shape picture is used for synthesizing the decoded picture with the background picture, not shown, for producing a decoded reproduced picture.
With the above-described encoding method, in which, for encoding the shape motion vector, the difference between the shape motion vector of a given block and the texture motion vector of another block is encoded, the encoding efficiency is low in consideration of the encoding volume required for encoding the difference from the motion vector of the other block and the encoding volume required in encoding the difference of motion vectors of the different sorts of pictures, namely the shape and texture pictures.
Moreover, using the texture motion vector of the same block simply for encoding the shape motion vector is difficult except for the case of reversible shape encoding in consideration that a locally decoded shape picture is required in order to find the texture motion vector as described above. Moreover, the encoding efficiency cannot be increased in case of the reversible encoding, thus worsening the overall encoding efficiency.
In addition, the texture motion vector of the previous block is required in order to find the shape motion vector, and the locally decoded shape picture or decoded reproduced picture is required for encoding or decoding the texture, the relation of interdependence is complex thus complicating the control.
There is also a problem that if, when the amplitude of the texture picture is smaller than the texture noise amplitude, a motion vector with a small difference (residual) is selected, motion vectors with arbitrary directions are generated from block to block to produce a riotous state of the motion vectors to increase the code volume required in encoding the texture motion vector.