A multi-viewpoint video image indicates a plurality of video images obtained by a plurality of (video) cameras disposed at different positions, which photograph the same subject and background. Below, a video image obtained by one camera is called a “two-dimensional video image”, and a set of two-dimensional video images obtained by photographing the same subject and background is called a “multi-viewpoint video image”.
The two-dimensional video images included in the multi-viewpoint video image have strong temporal correlation. When the cameras are operated in synchronism with each other, frames are obtained by the cameras which photograph the subject and background in the exact same state; thus, there is strong correlation between the cameras.
A conventional technique relating to the coding of the two-dimensional video image will be discussed.
In a number of known two-dimensional video coding methods such as H.264, MPEG-4, or MPEG-2 which are international coding standards, highly efficient coding is performed using techniques such as motion compensation, orthogonal transformation, quantization, or entropy coding.
For example, in the case of H.264, “I frame” can be encoded using intra-frame correlation, “P frame” can be encoded using inter-frame correlation with respect to a plurality of past frames, and “B frame” can be encoded using inter-frame correlation with respect to past or future frames appearing at intervals.
The I frame is divided into blocks (called “macroblocks”, each having a size of 16×16), and intra prediction is performed in each macroblock. In the intra prediction, each macroblock may be further divided into smaller blocks (called “sub-blocks” below) so that intra prediction may be performed in each sub-block.
In the P frame, intra prediction or inter prediction can be performed in each macroblock. The intra prediction for the P frame is similar to that for the I frame. In the inter prediction, motion compensation is performed. Also in the motion compensation, the macroblock can be divided into smaller blocks, and divided sub-blocks may have different motion vectors and different reference images (or pictures).
Also in the B frame, intra prediction or inter prediction can be performed. In the inter prediction for the B frame, not only a past frame but also a future frame can be a reference image (or picture) for motion compensation. For example, a frame configuration of “I frame→B frame→B frame→P frame” is encoded in the order of I→P→B→B. For the B frame, motion compensation can be performed with reference to the I frame and the P frame, and similarly in the P frame, sub-blocks obtained by dividing each macroblock may have different motion vectors.
When intra or inter prediction is performed, a prediction residual is obtained; however, in each macroblock, a prediction residual block is defined and subjected to DCT (discrete cosine transform) so as to perform quantization. More specifically, a macroblock having a block size of 16×16 is divided into sub-blocks, each having a size of 4×4, and 4×4 DCT is performed. The sequence of quantized values of DCT coefficients is represented using the following data:    (i) Coded block pattern: data for indicating in which block a DCT coefficient which is not zero (called a “non-zero coefficient”) is present among four 8×8 blocks which can be defined in the relevant macroblock,    (ii) Coded block flag: data for indicating in which 4×4 block the non-zero coefficient is present among four 4×4 blocks in the 8×8 block in which the non-zero coefficient is present,    (iii) Significance map: data for indicating which coefficient is the non-zero coefficient among DCT coefficients which are present in the 4×4 block indicated by the coded block flag data.    (iv) Level data: data indicating the value of the non-zero coefficient indicated by the significance map data.
In addition to the data relating to the DCT coefficients, data indicating the method of dividing each macroblock into sub-blocks and the motion vectors are subjected to reversible encoding called “entropy encoding”, and encoding is completed.
Here, data to be entropy-encoded other than quantized values in pixel area and quantized values of transformation coefficients resulting from orthogonal transformation applied to an image block (the quantized values correspond to the above level data for the case of the DCT coefficients) is called “auxiliary data” below. In the case of H.264, the following are examples of the auxiliary data other than those relating to the DCT coefficients. This auxiliary data is provided for each macroblock:    (i) Macroblock type or sub-macroblock type: the macroblock type is an index which indicates a combination of a designation whether intra prediction or inter prediction is performed in the macroblock, a prediction method, a block dividing method, and the like, and the sub-macroblock type is an index which indicates a combination of a prediction method in the sub block, a block dividing method, and the like,    (ii) Reference image index: an index value of a reference image (or picture) used for motion compensation in each sub-block, and    (iii) Motion vector in each sub-block: in H.264, the motion vector is represented as a residual of prediction using peripheral motion vectors.
A general explanation about entropy encoding will be provided below.
Entropy encoding is reversible encoding. Generally, reversible encoding is a process of converting a symbol to be encoded (which may be interpreted as a value extracted from a set of integers) to a bit sequence including digits 1 and 0. For example, when the symbol to be encoded is a value included in a set of integers “0, 1, 2, 3”, reversible encoding is implemented by encoding the symbol to (i) 00 when the symbol is 0, (ii) 01 when the symbol is 1, (iii) 10 when the symbol is 2, and (iv) 11 when the symbol is 3. Such encoding is called fixed-length encoding. A set of codes for encoding the symbol (in this example, “00, 01, 10, 11”) is called a “code table”.
Fixed-length encoding is reversible encoding, however, the encoding efficiency is not good. In information theory, it is known that highly efficient reversible encoding can be performed by using a probability of symbol appearance (i.e., probability distribution with respect to the set of integers). Generally, a short code length is allocated to a symbol having a high probability of appearance, while a long code length is allocated to a symbol having a low probability of appearance. This is so that on an average, more efficient encoding can be performed in comparison with fixed-length encoding. As discussed above, reversible encoding using probability distribution is called “entropy encoding”.
However, in order to perform such highly efficient entropy encoding, the probability distribution of the symbol to be encoded must be known before encoding. Therefore, conventionally, the probability distribution is experientially determined, or learned while executing encoding. In addition, there is a known method of obtaining an optimum code table based on the probability distribution of the symbol (i.e., a method using Huffman codes or arithmetic codes). Therefore, in the following explanation, the probability distribution is treated as an equivalent for the code table.
When entropy encoding is applied to the encoding of auxiliary data, pixel values, and transformation coefficient values of a video image, such data to be encoded has a different probability at a different position in an image. Therefore, in order to perform highly efficient encoding, it is necessary to switch the code table in accordance with the position in the image, so as to select an appropriate code table used for encoding.
In H.264, such switching is executed using a method called “CABAC (context-adaptive binary arithmetic coding)” (see Reference Document 1: Non-Patent Document 1 described later). Below, a general explanation of CABAC in H.264 will be provided in an example for encoding a macroblock type.
In CABAC, when the macroblock type of a macroblock is encoded, the code table is switched with reference to already-encoded macroblock types of macroblocks which are positioned above and left of the target macroblock.
FIG. 17 shows the concept of such a reference relationship. In FIG. 17, the macroblocks indicated by reference symbols A and B have a strong correlation with the target macroblock to be encoded.
In CABAC, an optimum code table is estimated using this correlation. Specifically, code tables are respectively assigned to all possible macroblock-type combinations between the macroblocks A and B, and the target macroblock type (to be encoded) is entropy-encoded using a code table (i.e., probability distribution) corresponding to the actual values of the macroblock types of the macroblocks A and B. Other data to be encoded is also entropy-encoded based on the same concept.
Next, conventional encoding of a multi-viewpoint video image will be discussed.
In the conventional encoding of a multi-viewpoint video image, in order to improve encoding efficiency using temporal correlation and parallactic (or inter-view) correlation (i.e., correlation between cameras) as described above, encoding employing temporal prediction and compensation between cameras is employed. Reference Document 2 (Non-Patent Document 2 described later) shows an example of such a method.
In the shown method, sets of frames called “GOPs” are classified into “Base GOPs” and “InterGOPs” so as to encode the GOPs. For the GOPs called “Base GOPs”, all frames included in the GOP images obtained by the same camera are encoding by intra or inter prediction; however, for the GOPs called “InterGOPs”, in addition to such intra or inter prediction, parallactic prediction using an image obtained by another camera may be used. Here, parallactic prediction means that when a macroblock of an image of a camera is encoded, an image obtained by another camera is used as a reference image so as to perform a process identical to motion compensation.
FIG. 18 shows an example of GOP configuration in this method. In this example, each GOP has two images (or pictures), and each arrow indicates reference relationship between images. In this method, temporal correlation and parallactic correlation are used for encoding, thereby obtaining an encoding efficiency higher than that obtained when using only temporal correlation.    Non-Patent Document 1: Detlev Marpe, et. al., “Context-Based Adaptive Binary Arithmetic Coding in the H.264/AVC Video Compression Standard”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 13. No. 7, pp. 620-636, July, 2003.    Non-Patent Document 2: Hideaki Kimata and Masaki Kitahara, “Preliminary results on multiple view video coding (3DAV),” document M10976, MPEG Redmond Meeting, July, 2004.
However, in the method disclosed by Reference Document 2, when a target multi-viewpoint video image is encoded, whether encoding is performed using temporal correlation or parallactic correlation is determined first based on an encoding efficiency, and if it is determined once that the encoding is performed using temporal correlation, parallactic correlation will never be considered in actual encoding.
In this case, if temporal variation in the subject and background is not so large in the multi-viewpoint video image to be encoded and thus temporal correlation is stronger than parallactic correlation, this method of Reference Document 2 cannot improve the encoding efficiency in comparison with a method using only temporal correlation.
This is because when the method of Reference Document 2 is applied to such a multi-viewpoint video image, temporal prediction is always used and encoding almost identical to the method using only temporal correlation is performed.
However, even when only temporal correlation is performed, auxiliary data such as a prediction residual, coefficients of orthogonal transformation, motion vectors, or macroblock types has correlation between cameras which may be used in encoding of such data.
On the other hand, regarding encoding in which a prediction error in motion compensation is quantized in a pixel area and an obtained quantized value is encoded, a method for encoding a two-dimensional image disclosed in Reference Document 3 (T. Shiodera, I. Matsuda, S. Itoh, “Lossless Video Coding Based on Motion Compensation and 3D Prediction˜A Study on Context Modeling˜”, the proceedings of the FIT 2003, No. J-053, pp. 303-304, September, 2003) may be used.
In the method of Reference Document 3, when a quantized value of a prediction residual is entropy-encoded, a motion vector obtained for each block is used for referring to a quantized value of a previous frame which has been encoded, so as to switch the code table.
Specifically, when given a position (x, y) of a target pixel to be encoded and a motion vector (mx, my) of a block which includes this pixel, quantized values corresponding to pixels in the vicinity of the position (x+mx, y+my) in the previous frame are referred to so as to switch the code table. The quantized values corresponding to the pixels in the vicinity of the position (x+mx, y+my) have correlation with the quantized value of the target pixel to be encoded; thus, the encoding efficiency can be improved using this method.
When this method is applied to entropy encoding of the quantized value of a prediction residual (in a pixel area) in a multi-viewpoint video image, a parallax (or disparity) vector for the same frame with respect to adjacent cameras is obtained for each block, and the process performed with respect to time using the method of Reference Document 3 may be performed with respect to parallax.
However, in such an easily analogized method, although the encoding efficiency of the quantized value itself of the prediction residual can be improved, the parallax vector for each block should be encoded; thus, it is difficult to improve the overall encoding efficiency.
Also in such an easily analogized method, it is impossible to efficiently encode data to be encoded (e.g., auxiliary data such as coefficients of orthogonal transformation, motion vectors, or macroblock types) other than the prediction residual in the pixel area. This is because there is no correlation between the prediction residual in the pixel area and other data to be encoded.