In recent years, apparatuses complying with schemes for compression based on an orthogonal transform such as a discrete cosine transform and motion compensation, such as MPEG, by utilizing redundancy specific to image information for the purpose of realizing high-efficiency transmission and accumulation of information have been increasingly prevalent for use in both distribution of information from broadcast stations or the like and receipt of information at general consumer homes.
In particular, MPEG2 (ISO/IEC 13818-2) is defined as a general-purpose image encoding scheme, and is a standard that covers both interlaced scanned images and progressive scanned images as well as standard-definition images and high-definition images, which is now widely used for a wide variety of applications including professional applications and consumer applications.
With the use of MPEG2, a high compression ratio and high quality implementation of images is feasible by, for example, assigning a code rate (bit rate) 4 to 8 Mbps to a standard-definition interlaced scanned image having 720×480 pixels. In addition, a high compression ratio and high quality implementation of images is feasible by assigning a code rate of 18 to 22 Mbps to a high-definition interlaced scanned image having 1920×1088 pixels.
MPEG2 has been intended primarily for high-quality encoding suitable for broadcasting, but has not supported encoding schemes of a lower code rate, that is, a higher compression ratio, than that of MPEG1.
With the increase in popularity of mobile terminals, the demand for such encoding schemes will increase in the future. To meet the demand, the MPEG4 encoding scheme was standardized. As to image encoding schemes, the ISO/IEC 14496-2 standard was approved as an international standard in December 1998.
Furthermore, a standard called H.264/AVC (MPEG-4 part 10, ISO/IEC 14496-10|ITU-T H.264) is also standardized. The standard is developed by an organization named JVT (Joint Video Team) jointly established by ITU-T and ISO/IEC to promote standardization of video encoding.
It is known that H.264/AVC requires a larger amount of computation for its encoding and decoding than conventional encoding schemes such as MPEG2 and MPEG4, but makes a higher encoding efficiency feasible.
[H.264/AVC]
FIG. 1 is a block diagram illustrating an example configuration of an image information encoding apparatus that implements image compression based on an orthogonal transform such as a discrete cosine transform or a Karhunen-Loeve transform and motion compensation.                1 denotes an A/D conversion unit, 2 denotes a screen rearrangement buffer, 3 denotes an adder unit, 4 denotes an orthogonal transform unit, 5 denotes a quantization unit, 6 denotes a lossless encoding unit, 7 denotes an accumulation buffer, 8 denotes a dequantization unit, 9 denotes an inverse orthogonal transform unit, 10 denotes a frame memory, 11 denotes a motion prediction/compensation unit, and 12 denotes a rate control unit.        
An image signal that is input is first converted into a digital signal by the A/D conversion unit 1.
Then, frames are rearranged by the screen rearrangement buffer 2 in accordance with the GOP (Group of Pictures) structure of image compression information that is output.
For an image to be subjected to intra-encoding, image information about the entire frames is input to the orthogonal transform unit 4, where an orthogonal transform such as a discrete cosine transform or a Karhunen-Loève transform is performed.
A transform coefficient that is the output of the orthogonal transform coefficient 4 is subjected to quantization processing by the quantization unit 5.
A quantized transform coefficient that is the output of the quantization unit 5 is input to the lossless encoding unit 6, where lossless coding such as variable length coding or arithmetic coding is performed. Thereafter, the resulting transform coefficient is accumulated in the accumulation buffer 7, and is output as image compression information. The behavior operation of the quantization unit 5 is controlled by the rate control unit 12.
Simultaneously, the quantized transform coefficient that is the output of the quantization unit 5 is input to the dequantization unit 8, and is in addition subjected to inverse orthogonal transform processing by the inverse orthogonal transform unit 9 into decoded image information. The information is accumulated in the frame memory 10.
An image to be subjected to inter-encoding is first input from the screen rearrangement buffer 2 to the motion prediction/compensation unit 11.
Simultaneously, image information to be referred to is retrieved from the frame memory 10, and is subjected to motion prediction/compensation processing. Reference image information is generated.
The reference image information is sent to the adder unit 3, and is converted here into a difference signal between the reference image information and the image information.
The motion compensation/prediction unit 11 simultaneously outputs motion vector information to the lossless encoding unit 6. The motion vector information is subjected to lossless coding processing such as variable length coding or arithmetic coding, and is inserted in the header portion of the image compression information. Other processing is similar to that for the image compression information to be subjected to intra-encoding.
FIG. 2 is a block diagram illustrating an example configuration of an image information decoding apparatus.                21 denotes an accumulation buffer, 22 denotes a lossless encoding/decoding unit, 23 denotes a dequantization unit, 24 denotes an inverse orthogonal transform unit, 25 denotes an adder unit, 26 denotes a screen rearrangement buffer, 27 denotes a D/A conversion unit, 28 denotes a frame memory, and 29 denotes a motion prediction/compensation unit.        
Image compression information (bit stream) that is input is first stored in the accumulation buffer 21, and is thereafter transferred to the lossless encoding/decoding unit 22.
In the lossless encoding/decoding unit 22, processing such as variable length decoding or arithmetic decoding is performed in accordance with a determined image compression information format.
Simultaneously, if the frame is an inter-encoded frame, the lossless encoding/decoding unit 22 also decodes motion vector information stored in the header portion of the image compression information, and outputs the information to the motion prediction/compensation unit 29.
A quantized transform coefficient that is the output of the lossless encoding/decoding unit 22 is input to the dequantization unit 23, and is here output as a transform coefficient.
The transform coefficient is subjected to an inverse orthogonal transform such as an inverse discrete cosine transform or an inverse Karhunen-Loeve transform by the inverse orthogonal transform unit 24 in accordance with a determined image compression information format.
In a case where the frame is an intra-encoded frame, image information subjected to inverse orthogonal transform processing is stored in the screen rearrangement buffer 26, and is output after D/A conversion processing.
In a case where the frame is an inter-encoded frame, a reference image is generated based on the motion vector information subjected to lossless decoding process and the image information stored in the frame memory 28. The reference image and the output of the inverse orthogonal transform unit 24 are combined by the adder unit 25. Other processing is similar to that for the intra-encoded frame.
The AVC standard developed by the JVT described previously is a hybrid coding scheme formed of motion compensation and a discrete cosine transform, like MPEG2 or MPEG4.
A discrete cosine transform may be an integer transform approximating a real discrete cosine transform. Although detailed schemes are different such as the transform method of the discrete cosine transform being a method that uses an integer coefficient with a 4×4 block size or the block size in motion compensation being variable, the basic scheme is similar to that of the encoding scheme implemented with the configuration in FIG. 1.
Meanwhile, in recent years, with the advancement of stereoscopic image capture and display technologies, studies on an extension of H.264/AVC to encoding of stereoscopic image signals have been advanced.
Standardization of MVC (Multiview Video Coding) that allows encoding of multi-viewpoint images captured using a plurality of image capture apparatuses is developed.
An image that is assumed to be captured and displayed from two viewpoints is called a stereo image. A naked-eye stereo display is capable of supporting multi-viewpoint display.
While the following description will be given of, mainly, a two-viewpoint stereo image by way of example, application to multi-viewpoint images obtained from three or more viewpoints in a similar manner can be made.
[MVC]
FIG. 3 is a diagram illustrating a multi-viewpoint encoding apparatus.
In a multi-viewpoint encoding apparatus 41, video signals supplied from two image capture apparatuses, that is, image capture apparatuses 31 and 32, are encoded, and bit streams generated by encoding are output. The bit streams composed of data of two-viewpoint images may be multiplexed into a single stream which is output, or may be output as two or more bit streams.
FIG. 4 is a block diagram illustrating an example configuration of the multi-viewpoint encoding apparatus 41 in FIG. 3.
In the multi-viewpoint encoding apparatus 41, a one-viewpoint image among multi-viewpoint images is encoded as a Base stream, and the other images are encoded as Dependent streams.
In the case of a stereo image, one image out of an L image (left-viewpoint image) and an R image (right-viewpoint image) is encoded as a Base stream, and the other image is encoded as a Dependent stream.
The Base stream is a bit stream similar to an existing AVC bit stream encoded using H.264 AVC/High Profile or the like. Therefore, the Base stream becomes a stream that can be decoded using an existing AVC decoder supporting H.264 AVC/High Profile.
Images to be encoded as a Base stream are input to a rearrangement buffer 51, and are rearranged in an order suitable for encoding as I pictures, P pictures, and B pictures. The rearranged images are output to a video encoding unit 52.
The video encoding unit 52 has a similar configuration to the image information encoding apparatus in FIG. 1. In the video encoding unit 52, for example, encoding is performed in compliance with H.264 AVC/High Profile, and a resulting bit stream is output to a multiplexing unit 57. In addition, a local decoded image is saved in a frame memory 53, and is used as a reference image for encoding the next picture or a picture in the Dependent stream.
In the meantime, images to be encoded as a Dependent stream are input to a rearrangement buffer 54, and are rearranged in an order suitable for encoding as I pictures, P pictures, and B pictures. The rearranged images are output to a dependent stream encoding unit 55.
In the dependent stream encoding unit 55, in addition to normal AVC encoding, encoding using, as a reference image, a local decoded image in the Base stream stored in a frame memory 53 is performed, and a bit stream is output to the multiplexing unit 57. In addition, the local decoded image is saved in the frame memory 56, and is used as a reference image for encoding the next picture.
In the multiplexing unit 57, the Base stream and the Dependent stream are multiplexed into a single bit stream which is output. The Base stream and the Dependent stream may be output as separate bit streams.
FIG. 5 is a diagram illustrating an example of an MVC reference image.
A Base stream is encoded by performing only prediction in the time direction in a manner similar to that in normal AVC.
A Dependent stream is encoded by performing, in addition to prediction in the time direction within a same-viewpoint image, which is similar to that in normal AVC, prediction using an image in the Base stream that is obtained at the same time point as a reference image. Even in a case where prediction in the time direction cannot be suitably performed, the capability of referring to an other-viewpoint image obtained at the same time point can improve encoding efficiency.
FIG. 6 is a block diagram illustrating the configuration of the video encoding unit 52 in FIG. 4 that generates a Base stream, and the frame memory 53.
The configuration illustrated in FIG. 6 is similar to the configuration of the image information encoding apparatus in FIG. 1, except for the point that an image saved in the frame memory 53 is referred to by the dependent stream encoding unit 55.
FIG. 7 is a block diagram illustrating the configuration of the dependent stream encoding unit 55 in FIG. 4 that generates a Dependent stream, and the frame memory 56.
The configuration illustrated in FIG. 7 is similar to the configuration of the image information encoding apparatus in FIG. 1, except for the point that an image saved in the frame memory 53 can be referred to. A reference image read from the frame memory 53 is input to a motion prediction/compensation unit 90, and is used for motion prediction and motion compensation.
FIG. 8 is a block diagram illustrating an example configuration of a multi-viewpoint decoding apparatus 101.
A Base stream supplied from the multi-viewpoint encoding apparatus 41 via a network or a recording medium is input to a buffer 111, and a Dependent stream is input to a buffer 114. In a case where a single multiplexed stream is supplied, the stream is separated into a Base stream and a Dependent stream which are input to the buffer 111 and the buffer 114, respectively.
The Base stream which is delayed in the buffer 111 for a predetermined period of time is output to a video decoding unit 112.
In the video decoding unit 112, the Base stream is decoded in accordance with AVC, and a resulting decoded image is saved in a frame memory 113. The decoded image saved in the frame memory 113 is used as a reference image for decoding the next picture or a picture in the Dependent stream.
The decoded image obtained by the video decoding unit 112 is output as a video signal to a 3D display 102 at a predetermined timing.
In the meantime, the Dependent stream which is delayed in the buffer 114 for a predetermined period of time is output to a dependent stream decoding unit 115.
In the dependent stream decoding unit 115, the Dependent stream is decoded, and a resulting decoded image is saved in a frame memory 116. The decoded image saved in the frame memory 116 is used as a reference image for decoding the next picture.
In the dependent stream decoding unit 115, as appropriate, the image saved in the frame memory 113 is used as a reference image in accordance with information (such as a flag) in the bit stream.
The decoded image obtained by the dependent stream decoding unit 115 is output as a video signal to the 3D display 102 at a predetermined timing.
In the 3D display 102, a stereo image is displayed in accordance with the video signal supplied from the video decoding unit 112 and the video signal supplied from the dependent stream decoding unit 115.
FIG. 9 is a diagram illustrating the configuration of the video decoding unit 112 in FIG. 8 that decodes a Base stream, and the frame memory 113.
The configuration illustrated in FIG. 9 is similar to the configuration of the image information decoding apparatus in FIG. 2, except for the point that the image saved in the frame memory 113 is referred to by the dependent stream decoding unit 115.
FIG. 10 is a block diagram illustrating the configuration of the dependent stream decoding unit 115 in FIG. 8 that decodes a Dependent stream, and the frame memory 116.
The configuration illustrated in FIG. 10 is similar to the configuration of the image information decoding apparatus in FIG. 2, except for the point that the image saved in the frame memory 113 can be referred to. A reference image read from the frame memory 113 is input to a motion prediction/compensation unit 148, and is used for motion prediction and motion compensation.