This invention relates to a motion detection apparatus for detecting motion of a block picture by so-called block matching.
As a system for high-efficiency encoding of picture signals, for example, a high-efficiency encoding system for picture signals for a so-called digital storage medium has been prescribed in a standardization proposal by the Moving Picture Experts Group (MPEG). The storage medium considered for the system is a storage medium having a continuous transfer rate of approximately 1.5M bits/sec or less, such as a compact disc (CD) or a data audio tape (DAT). The storage medium is intended to be connected not only directly to a decoder but also via a transmission medium, such as a computer bus, local area network (LAN) or telecommunication. In addition, the storage medium is intended to perform not only forward reproduction, but also special functions, such as random access, high-speed reproduction, and reverse reproduction.
The following is the principle of the high-efficiency encoding for picture signals by MPEG.
That is, with the present high-efficiency encoding system, the difference between pictures is taken to lower redundancy along a time axis. Then, so-called discrete cosine transform (DCT) and variable length encoding are used to lower redundancy along a spatial axis.
The above-mentioned redundancy along the time axis is hereinafter explained.
In a continuous moving picture, in general, a picture at a certain moment is similar to pictures temporally preceding and succeeding it. Therefore, by taking the difference between a picture now to be encoded and the pictures temporally preceding it and then transmitting the difference, as shown for example in FIG. 1, it becomes possible to diminish redundancy along the time axis and to reduce the amount of the transmitted information. The picture encoded in this manner is termed a predictive-coded picture, P-picture or P-frame, as later explained. Similarly, by taking the difference between the picture now to be encoded and the pictures temporally preceding or succeeding it or interpolated pictures prepared from the temporally preceding and succeeding pictures, and then transmitting a smaller one of the differences, it becomes possible to diminish redundancy along the time axis to reduce the amount of the transmitted information. The picture encoded in this manner is termed a bidirectionally predictive-coded picture, B-picture or B-frame, as later explained. In FIG. 1, a picture shown by I indicates an intra-coded picture, I-picture or I-frame, while pictures P and B indicate the above-mentioned P-pictures and B-pictures, respectively.
For producing the predictive-coded pictures, so-called motion compensation is carried out.
According to the motion compensation, by producing a block of 16.times.16 pixels, hereinafter referred to as a macro-block, constituted for example by a plurality of unit blocks of 8.times.8 pixels, then searching a nearby block showing the minimum difference from the macro-block, and taking the difference between the searched block and the macro-block, the volume of transmitted data can be reduced. For example, in the above-mentioned P-picture (predictive-code picture), picture data produced by taking the difference from the motion-compensated predicted picture and picture data produced without taking the difference form the motion-compensated predicted picture are compared, and the picture data of a smaller data volume than the other, is selected for each 16.times.16 pixel macro-block for encoding.
However, for a portion or picture which has emerged from behind a moved object, the volume of data to be transmitted is increased. Thus, with the B-picture, one having the smallest volume of the following four kinds of picture data is encoded: picture data obtained by taking the difference between a picture now to be encoded and the decoded and motion-compensated temporally preceding picture; picture data obtained by taking the difference between the picture now to be encoded and the decoded motion-compensated temporally succeeding picture; picture data obtained by taking the difference between a picture now to be encoded and an interpolated picture obtained by summing the temporally preceding and succeeding pictures; and the picture now to be encoded.
The redundancy along the spatial axis is hereinafter explained.
The difference between picture data is not transmitted directly, but is processed by discrete cosine transform (DCT) for each of the 8.times.8 pixel unit blocks. The DCT expresses the picture not by the pixel level but by how much and which frequency components of the cosine function are contained in the picture. For example, two-dimensional DCT converts data of the 8.times.8 pixel unit block into 8.times.8 coefficient blocks of cosine function components. For example, image signals of a natural scene photographed with a camera tend to become smooth signals. In this case, the data volume may be efficiently diminished by DCT processing of the picture signals.
That is, if the picture signals are smooth as in the case of picture signals for a natural scene, larger coefficient values concentrate around a certain coefficient value as a result of DCT processing. If this value is quantized, most of the 8.times.8 pixel coefficient blocks become zero, leaving only larger coefficients. Thus, data of the 8.times.8 pixel coefficient blocks are transmitted in so-called zigzag scan sequence, by employing a so-called Huffman code consisting of a non-zero coefficient and a so-called zero run indicating how many 0s preceded the coefficient, and thereby the transmitted data volume can be reduced. The picture can be reconstructed by the reverse sequence at the decoder.
FIG. 2 shows a data structure handled by the above-described encoding system. The data structure is made up, from the bottom, of a block layer, a macro-block layer, a slice layer, a picture layer, a group-of-picture (GOP) layer, and a video sequence layer, as shown in FIG. 2. These layers are explained from the bottom side in FIG. 2.
Referring first to the block layer, the blocks of the block layer are constituted by 8.times.8 pixels (8 lines.times.8 pixels) having neighboring luminances or chrominances. Each of these unit blocks is processed by discrete cosine transform (DCT).
Referring to the macro-block layer, it is made up of six blocks, namely left upper, right Upper, left lower and right lower unit luminance blocks Y0, Y1, Y2 and Y3 and unit chrominance blocks Cr, Cb which are in the same positions as those of the unit luminance blocks when viewed on the picture. These blocks are transmitted in the sequence of Y0, Y1, Y2, Y3, Cr and Cb. In the present encoding system, which picture to use as a predictive picture OF a reference picture for difference taking, or whether to transmit the difference, is decided from one macro-block to another.
The slice layer is made up of one or more macro-blocks continuously arrayed in the picture scan sequence. At a leading end (header) of the slice, a difference of a dc component and a motion vector in the picture is reset, and the first macro-block has data indicating its position in the picture. Accordingly, the position in the image may be restored on error occurrence. Therefore, the length and the starting position of the slice are arbitrarily set and may be changed depending on the state of errors produced in the transmission path.
In the picture layer, each picture is made up of one or more of the slices. The pictures are classified into the above-mentioned four types of pictures, that is, the intra-coded picture (I-picture or I-frame), the predictive-coded picture (P-picture or P-frame), the bidirectionally predictive-coded picture (B-picture or B-frame), and DC coded picture, each in accordance with the encoding system.
In the intra-coded picture (I-picture), only the information closed in that particular picture is employed at the time of encoding. In other words, the picture can be reconstructed solely by the information within that I-picture at the time of decoding. In effect, encoding is carried out by direct DCT processing without taking the difference. Although the encoding system generally has a poor efficiency, random access or high-speed reproduction may be realized by inserting the I-picture into various places in the MPEG encoding system.
In the predictively-coded picture (P-picture), the I-picture or the P-picture, which are in temporally preceding positions at the input and are already decoded, are used as the reference picture. In effect, the one higher in efficiency, of encoded data obtained after taking the difference from the motion-compensated reference picture and encoded data without taking the difference (intra-code), is selected from one macro-block to another.
In the bidirectional predictive-coded picture (B-picture), three types of pictures are used, that is, the I-picture or P-picture both of which precede temporally and have already been decoded, and interpolated pictures produced from the I-picture and the P-picture. In this manner, the most efficient one of the encoded data of the difference after the motion compensation and the intra-coded data may be selected from one macro-block to another.
The above-mentioned DC coded I-picture is an intra-coded picture constituted solely by DC coefficients of DCT, and cannot exist in the same sequence as the remaining three picture types.
The group-of-picture (GOP) layer is made up of one or more I-pictures and zero or plural non-I-pictures. If the input sequence to the encoder is, for example, 1I, 2B, 3B, 4P*5B, 6B, 7I, 8B, 9B, 10I, 11B, 12B, 13P, 14B, 15B, 16P*17B, 18B, 19I, 20B, 21B and 22P, an output sequence of the encoder, that is, the input sequence to the decoder is 1I, 4P, 2B, 3B*7I, 5B, 6B, 10I, 8B, 9B, 13P, 11B, 12B, 16P, 14B, 15B,19I, 17B, 18B, 22P, 20B and 21B. The reason such an exchange of sequence is carried out in the encoder is that, for encoding or decoding the B-picture, for example, the I-picture or the P-picture, which is the reference picture therefor, has to be encoded in advance of the B-picture. The interval for the Z-picture, such as 9, or the interval for the I-picture or the B-picture, such as 3, is arbitrarily set. Besides, the Z-picture or P-picture interval may be changed within the GOP layer. The junction point of the GOP layer is indicated by *, while the I-picture, P-picture and the B-picture are indicated by Z, P and B, respectively.
The video sequence layer is made up of one or more of the GOP layers having the same picture size and the same picture rate.
With the moving picture encoding system standardized by MPEG, the information of a picture compressed in itself is first transmitted, and then the difference between the picture and a motion-compensated picture therefor is transmitted.
In addition, in the conventional motion compensation, a motion vector is transmitted for each macro-block and, at the picture decoder, the already decoded picture is translated for motion compensation based on the motion vector, thereby reducing the difference from the picture and hence the volume of the transmitted information for enabling efficient picture transmission.
However, in case of producing a picture for movement of a man raising his arm, in which not only the position but also the angle of the arm is changed, the above-mentioned motion compensation cannot provide an appropriate predictive picture, thereby increasing the volume of the difference and deteriorating the picture quality.