As of recent, there have come into widespread use, both for information distribution such as broadcasting stations and information reception in general homes, devices which conform to formats such as MPEG (Moving Picture Expert Group) which uses redundancy which is characteristic to image information and perform compression by orthogonal transform such as discreet cosine transform and the like and motion compensation, in order to transmit and store information with high efficiency when handling image information as digital signals.
That is to say, there is coming into widespread use encoding devices and decoding devices used for processing performed when receiving image information compressed by an encoding format using orthogonal transform such as discrete cosine transform or Karhunen-Louève transform and motion compensation, such as MPEG or H.26x or the like for example (bit stream) via network media such as satellite broadcasting, cable TV, the Internet or the like, or at the time of processing such on storage media such as optical discs or magnetic disks, flash memory, or the like.
For example, MPEG2 (ISO/IEC 13818-2) is defined as a general-purpose image encoding format, and is a standard which encompasses both interlace scanning images (interlace format images) and sequential scanning images (progressive format) images, and also standard resolution images and high-resolution images, and is currently widely used in a broad range of professional-use and consumer use applications. Using the MPEG2 compression format enables realization of high compression rate and good image quality by assigning a code amount (bit rate) of 4 to 8 Mbps for interlace scanning images of standard resolution having a horizontal×vertical size of 720×480 pixels, and 18 to 22 Mbps interlace scanning images of high resolution having a horizontal×vertical size of 1920×1088 pixels.
MPEG2 was primarily for high image quality encoding suitable for broadcasting, but did not handle code amounts (bit rate) lower than MPEG1, i.e., encoding formats with even higher compression rates. It is thought that cellular phones coming into widespread use will increase the demand for such encoding formats from now on, and accordingly the MPEG4 encoding format has been standardized. With regard to the image encoding method, the standard thereof was recognized as an international standard as ISO/IEC 14496-2 in December 1998.
Further, as of recent, progress is being made on standardization of a standard called H.264 (ITU-TQ6/16 VCEG), which initially was intended for image encoding for videoconferencing. It is known that H.264 can realize even higher encoding efficiency as compared with conventional encoding methods such as MPEG2 or MPEG4, though greater computation amounts are required for the encoding and decoding thereof. Also, currently, standardization based on H.264 for realizing even higher encoding efficiency including functions not supported by H.264 is being worked on as part of MPEG4 activities, as Joint Model of Enhanced-Compression Video Coding.
With regard to the encoding format (JVT Codec) being standardized by the Joint Video Team, various improvements are being studied to improve the encoding efficiency beyond the existing art such as MPEG2 or MPEG4 or the like. For example, with discrete cosine transform, transform to integer transform coefficients is performed on blocks of 4×4 pixels. Also, with motion compensation, the block sizes are variable, and optimal motion compensation can be performed. Note however, that the basic algorithms for encoding are the same as with the existing art such as MPEG2 or MPEG4 or the like.
Now, as for image contents to be subjected to encoding such as described above, there are stereoscopic image contents which can be viewed by stereoscopy, in addition to 2-dimensional images (2D images)
A dedicated device (hereinafter, stereoscopy device) is used for displaying stereoscopic images, an example of such a stereoscopy device being an IP (Integral Photography) stereoscopic image system developed by NHK (Japan Broadcasting Corporation).
Image data of a stereoscopic image is made up of image data from multiple viewpoints (image data of images shot from multiple viewpoints), and the greater the number of viewpoints there are and the wider range the viewpoints are spread over, the more a “television which can be looked into”, as if it were, can be realized where the subject can be seen from various directions.
Now, a method for encoding and decoding image data of stereoscopic images, i.e., image data of multiple viewpoints, is described in, for example, PTL 1.
Of stereoscopic images, that which has the fewest number of viewpoints is a 3D (Dimensional) image of which the number of viewpoints is two viewpoints (stereo image), with the image data of the 3D image being made up of image data of a left eye image which is an image observed with the left eye (hereinafter also referred to as L (Left) image), and a right eye image which is an image observed with the right eye (hereinafter also referred to as R (Right) image).
AS described above, a 3D image (stereo image) is configured of an L image and R image, so image data of the two screens worth of the L image and R image (two screens worth of a case of displaying a 2D image) is necessary for display of one screen of a 3D image.
However, depending on the transmission band of a transmission path for transmitting the 3D image, storage capacity of the recording medium for recording the 3D image, transfer rate restrictions to the recording medium, and so forth, there are cases where it is difficult to transmit two screens worth of image data (including recording to the recording medium) for displaying one screen of the 3D image.
Accordingly, an encoding device has been proposed which performs processing to convert the image data for displaying one screen of 3D image into image data for one screen worth by performing sub-sampling (thinning out) of each of the L image and R image making up the 3D image, in the spatial direction, following which the image data is encoded.
FIG. 1 is a diagram for describing methods for thinning out (pixels of) the L image and R image making up the 3D image.
A in FIG. 1 is a diagram illustrating the L image and R image.
The L image and R image each have one screen worth of a 2D image (2-dimensional image).
B in FIG. 1 illustrates an image where the spatial resolution in the horizontal direction is made to be ½ that of the original by thinning out the pixels of each of the L image and R image every other line in the vertical direction.
Note that for thinning out every other line in the vertical direction, either odd-numbered or even-numbered pixels from the left of the L image and R image may be thinned out, or an arrangement may be made where, of the L image and R image, one of odd-numbered and even-numbered pixels is thinned out for the L image and the other is thinned out for the R image.
C in FIG. 1 illustrates an image where the spatial resolution in the vertical direction is made to be ½ that of the original by thinning out the pixels of each of the L image and R image every other line in the horizontal direction.
Note that for thinning out every other line in the horizontal direction, either odd or even-numbered pixels from the top of the L image and R image may be thinned out, or an arrangement may be made where, of the L image and R image, one of odd-numbered and even-numbered pixels is thinned out for the L image and the other is thinned out for the R image.
D in FIG. 1 illustrates an image where the spatial resolution in oblique direction is made to be ½ that of the original by thinning out the pixels of each of the L image and R image every other line in an oblique direction (either the oblique direction toward the upper left or the oblique direction toward the upper right).
The L image and R image after thinning out in D in FIG. 1 are images where pixels are arrayed in checkerboard fashion, due to the thinning out of pixels in the oblique direction.
With thinning out of pixels in the oblique direction, the pixels to be thinned out from one of the L image and R image may be pixels of the same pixels to be thinned out from the other image, or may be pixels other than pixels to be thinned out from the other image (pixels at the positions where pixels are remaining in the other image following thinning out).
The number of pixels of the L image and R image following thinning out is ½ of the original with any thinning of B in FIG. 1 through D in FIG. 1, and consequently the overall data amount (number of pixels) of the L image and R image following thinning out is equal to the data amount of one screen worth of image data of a 2D image.
Note that when thinning out pixels, filtering is necessary to cut out high-band components in order to prevent aliasing from occurring due to the thinning out, and blurring occurs in the L image and R image following thinning out due to this filtering.
Human sight tends to be insensitive in the oblique direction as compared to the horizontal direction or vertical direction, so by performing pixel thinning out of pixels in the oblique direction, visually apparent blurring can be reduced.
FIG. 2 is a block diagram illustrating the configuration of an example of a conventional encoding device which performs thinning out of every other line in the oblique direction of the pixels of each of the L image and R image as described with D in FIG. 1, and encoding the thinned out L image and thinned out R image with pixels arrayed in checkerboard fashion that are obtained as a result thereof.
With the encoding device in FIG. 2, (image data of) a 3D image which is a moving image, for example, is supplied to a filter unit 11, in increments of single screens.
That is to say, an L image and R image making up one screen of a 3D image is supplied to the filter unit 11.
The filter unit 11 performs filtering to cut out high-band components (of the oblique direction spatial frequencies) of the L image and R image to prevent aliasing from occurring in the thinned out L image and thinned out R image obtained by thinning out the L image and R image.
That is to say, the filter unit 11 is configured of filters 11L and 11R which are low-pass filters.
The filter 11L performs filtering of the L image supplied to the filter unit 11, and supplies to a thinning out unit 12. The filter 11R performs filtering of the R image supplied to the filter unit 11, and supplies to the thinning out unit 12.
The thinning out unit 12 performs thinning out of the pixels of the L image supplied from the filter unit 11 every other line in the oblique direction as described with D in FIG. 1, whereby the L image from the filter unit 11 is converted into a thinned out L image with pixels arrayed in checkerboard fashion.
Further, the thinning out unit 12 performs thinning out of the pixels of the R image supplied from the filter unit 11 in the oblique direction in the same way, whereby the R image from the filter unit 11 is converted into a thinned out R image with pixels arrayed in checkerboard fashion.
That is to say, the thinning out unit 12 is configured of thinning out units 12L and 12R.
The thinning out unit 12L performs thinning out of the pixels of the L image supplied from the filter unit 11 every other line in the oblique direction as described with D in FIG. 1, and supplies a thinned out L image with the pixels arrayed in checkerboard fashion (checkerboard pattern) to a combining unit 13.
The thinning out unit 12R performs thinning out of the pixels of the R image supplied from the filter unit 11 every other line in the oblique direction as described with D in FIG. 1, and supplies a thinned out R image with the pixels arrayed in checkerboard fashion to the combining unit 13.
We will say that the thinning out unit 12R performs thinning out from the R image of pixels other than the pixels which the thinning out unit 12L has performed thinning out of from the L image.
Accordingly, the thinned out L image (or thinned out R image) is an image with pixels where there are no pixels in the thinned out R image (or thinned out L image).
The combining unit 13 combines the thinned out L image and the thinned out R image supplied from the thinning out unit 12, generates a combined image equal to the data amount of image data of one screen worth of a 2D image, and supplies this to an encoder 14.
The encoder 14 encodes the combined image supplied from the combining unit 13 with the MPEG2 format or H.264/AVC format or the like for example, and outputs the encoded data obtained as a result thereof. The encoded data which the encoder 14 outputs is transmitted via a transmission medium, or recorded in a recording medium.
FIG. 3 is a diagram for describing combining of the thinned out L image and thinned out R image at the combining unit 13 in FIG. 2.
A in FIG. 3 is a diagram illustrating the thinned out L image and thinned out R image to be combined at the combining unit 13.
The thinned out L image and thinned out R image are such that the pixels (remaining after thinning out) are arrayed in checkerboard fashion.
That is to say, we will express (the pixel value of) a pixel x'th from the left and y'th from the top that makes up the thinned out L image as Lx, y, and a pixel x'th from the left and y'th from the top that makes up the thinned out R image as Rx, y.
Also, we will say that C being the remainder of dividing A by B will be expressed by the expression modd(A, B)=C.
The thinned out L image is an image where the pixel Lx, y is situated at a position (x, y) satisfying an expression modd(x, 2)=modd(y, 2)=0 and a position (x, y) satisfying an expression modd(x, 2)=modd(y, 2)=1 (or at a position (x, y) satisfying expression modd(x, 2)=1 and expression modd(y, 2)=0, and at a position (x, y) satisfying expression modd(x, 2)=0 and expression modd(y, 2)=1).
Also, the thinned out R image is an image where the pixel Rx, y is situated at a position (x, y) satisfying expression modd(x, 2)=1 and expression modd(y, 2)=0, and at a position (x, y) satisfying expression modd(x, 2)=0 and expression modd(y, 2)=1 (or at a position (x, y) satisfying expression modd(x, 2)=modd(y, 2)=0 and a position (x, y) satisfying expression modd(x, 2)=modd(y, 2)=1).
B in FIG. 3 illustrates a combined image obtained by combining the thinned out L image and thinned out R image at the combining unit 13 shown in FIG. 2.
The combining unit 13 generates a combined image where the pixels Lx, y of the thinned out L image and the pixels Rx, y of the thinned out R image are arrayed in checkerboard fashion, by fitting, as if it were, the pixels Rx, y of the thinned out R image into the thinned-out L image at positions where the pixels Lx, y of the thinned out L image are not arrayed, for example.
That is to say, the combining unit 13 situates the pixels Lx, y of the thinned out L image at a position (x, y) satisfying expression modd(x, 2)=modd(y, 2)=0 and a position (x, y) satisfying expression modd(x, 2)=modd(y, 2)=1, and also situates the pixels Rx, y of the thinned out R image at a position (x, y) satisfying expression modd(x, 2)=1 and expression modd(y, 2)=0 and a position (x, y) satisfying expression modd(x, 2)=0 and expression modd(y, 2)=1, thereby generating a combined image equal to the data amount of image data of one screen worth of a 2D image.
Accordingly, if we express (the pixel value of) a pixel at position (x, y) in the combined image as Cx, y, Cx, y is equal to the pixels Lx, y of the thinned out L image at a position (x, y) satisfying expression modd(x, 2)=modd(y, 2)=0 and a position (x, y) satisfying expression modd(x, 2)=modd(y, 2)=1 (Cx, y=Lx, y).
Also, Cx, y is equal to the pixels Rx, y of the thinned out R image at a position (x, y) satisfying expression modd(x, 2)=1 and expression modd(y, 2)=0 and a position (x, y) satisfying expression modd(x, 2)=0 and expression modd(y, 2)=1 (Cx, y=Rx, y).
FIG. 4 is a block diagram illustrating the configuration of an example of a conventional decoding device which decodes encoded data output from the encoding device shown in FIG. 2.
With the decoding device in FIG. 4, a decoder 21 is supplied with encoded data which the encoding device outputs.
The decoder 21 performs decoding with a format corresponding to the format with which an encoder 34 in FIG. 2 performs encoding.
That is to say, the decoder 21 decodes the encoded data supplied thereto with the MPEG2 format or H.264/AVC format for example, and supplies the combined image obtained as a result thereof to a 3D display device 22.
The 3D display device 22 is a stereoscopic device capable of 3D display (displaying as a 3D image) of the combined image where the pixels Lx, y of the thinned out L image and the pixels Rx, y of the thinned out R image shown in B in FIG. 3 are arrayed in checkerboard fashion, and displays a 3D image by displaying the L image and R image for example, in accordance with the combined image from the decoder 21.