In general video coding, interframe predictive coding is used so as to achieve a high encoding efficiency by using temporal correlation. Frame encoding modes include: “I frame” for which encoding is performed without using interframe correlation, “P frame” for which prediction is performed based on one frame which has been encoded, and “B frame” for which prediction can be performed using two frames which have been encoded.
More specifically, “P frame” can be predicted using “I frame” or “P frame”, and “B frame” can be predicted using “I frame”, “P frame”, or “B frame”. In particular, in video coding standard H.264, decoded images of a plurality of frames are stored in a reference image (or picture) memory in an encoding apparatus, and any reference image can be selected and read out from the memory to be used for prediction. Additionally, in a P frame, prediction is performed using a temporally past frame in an input video image; however, in a B frame, prediction can be performed using not only a temporally past frame but also a future frame.
In FIG. 7, part (a) shows an example of a prediction relationship assigned to a video image.
Regarding a B frame for which prediction is performed using two frames (in bidirectional prediction), image data of two relevant frames are subjected to interpolation so as to generate image data for one frame. When encoding first to seventh frames with an encoding mode sequence of “IBBPBBP”, there is a prediction relationship shown in part (a). Therefore, with frame numbers 1 to 7 respectively assigned to the frames shown in part (a) from the left, these frames are actually encoded in the frame-number order of “1→4→2→3→7→5→6” as shown in part (b) in FIG. 7.
In an interlace video image, one frame includes two fields. Also in this case, a prediction relationship is determined for each field, similar to the above case. Either the frame or the field is generically called the “picture”. In the bidirectional prediction for B frame, prediction can be performed using two past frames or two future frames. For example, in the video coding standard H.264, a plurality of frames of decoded images are stored in a reference image memory, and reference images for two frames can be selected and read out from the memory so as to perform the prediction. Here, the index times of the selected frames after decoding may be before or after the index time of a target frame to be encoded.
In addition, when a set of pictures having an “I picture” as the head thereof can be determined as a GOP (group of pictures), it is possible to easily realize a temporal random access function for encoded data per GOP unit.
With regard to this GOP, data for indicating the head of the GOP is provided before the encoded data of a specific picture, so as to indicate that this picture is the head of the GOP consisting of a plurality of pictures starting from this picture. In MPEG-2 standard, the head of GOP is indicated by inserting a code having a specific bit pattern.
That is, encoded data of one GOP can be formed between codes each indicating the head of a GOP. Instead of including a code for indicating the head of a GOP in the encoded data, GOP formation data independent of the encoded data may be employed.
Generally, after the data for indicating the head of the GOP, time data of the head frame of the GOP is also provided, which is used for implementing a temporal random access function. In addition, each picture may be provided with time data.
For example, such time data is called “TR(time reference)” in the H.263 standard. TR is data for indicating the output order of frames based on a unit time. If the unit time is set to 1/30 sec, value increment by one for each frame is equivalent to a frame rate of 30 frames/sec. Usually, TR is subjected to fixed-length encoding.
In order to encode video images obtained by a plurality of (video) cameras, a method has been proposed in which each camera image is determined as a GOP, and predictive encoding is applied between GOPs so as to generate one encoded video data.
For example, in Non-Patent Document 1 or Non-Patent Document 2 described later, “Base GOP” and “InterGOP” are defined so as to indicate a prediction relationship between the GOPs. Each picture included in the Base GOP refers to only pictures included in the same GOP, and each picture included in the InterGOP refers to pictures included in the same GOP or another GOP. The header portion of the InterGOP includes reference GOP data for indicating a GOP to be referred to.
Therefore, when a plurality of video images input from a plurality of cameras are obtained in advance, the viewing position and direction can be changed by switching the input image. Accordingly, the image corresponding to the position where photographing was performed is obtained. In addition to this, a technique has also been proposed for producing an image corresponding to a viewing position or direction at or in which no photographing is performed.
For example, Non-Patent Document 3 described below discloses a technique for producing an image corresponding to a viewing position or direction at or in which no photographing is performed, by generating a ray space using images input from a plurality of cameras and extracting image data from the ray space.
Generally in such a video generating technique, when the same subject is included in input images obtained by a plurality of cameras, image data of the subject corresponding to a viewing position or direction at or in which no photographing is performed is generated using the obtained image data. That is, image data for a subject which is present over input images obtained by a plurality of cameras is generated using part of each input image.
An adaptive filtering method (refer to Non-Patent Document 4) or a table reference method (refer to Non-Patent Document 5) belongs to the above image generating technique.
Non-Patent Document 1: Hideaki Kimata and Masaki Kitahara, “Preliminary results on multiple view video coding (3DAV),” document M10976 MPEG Redmond Meeting, July, 2004.
Non-Patent Document 2: Hideaki Kimata, Masaki Kitahara, Kazuto Kamikura, Yoshiyuki Yashima, Toshiaki Fujii, and Masayuki Tanimoto, “System Design of Free Viewpoint Video Communication,” CIT2004, September, 2004.
Non-Patent Document 3: T. Fujii, T. Kimoto, M. Tanimoto, “Compression of 3D Space Information based on the Ray Space Representation”, 3D Image Conference '96, pp. 1-6, July, 1996.
Non-Patent Document 4: T. Kobayashi, T. Fujii, T. Kimoto, M. Tanimoto, “Interpolation of Ray-Space Data by Adaptive Filtering”, IS&T/SPIE Electronic Imaging 2000, 2000.
Non-Patent Document 5: M. Kawaura, T. Ishigami, T. Fujii, T. Kimoto, M. Tanimoto, “Efficient Vector Quantization of Epipolar Plane Images of Ray Space By Dividing into Oblique Blocks”, Picture Coding Symposium 2001, pp. 203-206, 2001.
With regard to the video images obtained by a plurality of video cameras, when images having sufficiently high quality can be obtained by a technique for generating a video image corresponding to a viewing position or direction at or in which no photographing is performed, a desired image can be reproduced on the video decoding side without encoding corresponding image data obtained by a certain camera, thereby improving the encoding efficiency with respect to the images obtained by the plurality of video cameras.
However, conventional video coding methods have no device for determining on the video decoding side whether a desired image can be reproduced without using the corresponding image obtained by a certain camera and also no device for encoding data for indicating that such reproduction is possible. Therefore, actually, video images of all video cameras are encoded and output; thus, the encoding efficiency cannot be improved.