1. Field of the Invention
This invention relates to a method, an apparatus, and a computer program for encoding signals representative of multi-view video taken from multiple viewpoints. In addition, this invention relates to a method, an apparatus, and a computer program for decoding coded data representative of multi-view video taken from multiple viewpoints.
2. Description of the Related Art
An MPEG (Moving Picture Experts Group) encoder compressively encodes a digital signal (data) representing a video sequence. The MPEG encoder performs motion-compensated prediction and orthogonal transform with respect to the video signal to implement highly efficient encoding and data compression. The motion-compensated prediction utilizes a temporal redundancy in the video signal for the data compression. The orthogonal transform utilizes a spatial redundancy in the video signal for the data compression. Specifically, the orthogonal transform is discrete cosine transform (DCT).
MPEG-2 Video (ISO/IEC 13818-2) established in 1995 prescribes the coding of a video sequence. MPEG-2 Video encoders and decoders can handle interlaced scanning pictures, progressive scanning pictures, SDTV (standard definition television) pictures, and HDTV (high definition television) pictures. The MPEG-2 Video encoders and decoders are used in various applications such as the recording and playback of data on and from a DVD or a D-VHS recording medium, and digital broadcasts.
MPEG-4 Visual (ISO/IEC 14496-2) established in 1998 prescribes the highly efficient coding of a video signal in applications such as network-based data transmission and portable terminal devices.
The standard called MPEG-4 AVC/H.264 (14496-10 in ISO IEC, H.264 in ITU-T) has been established by the cooperation of ISO/IEC and ITU-T in 2003. MPEG-4 AVC/H.264 provides a higher coding efficiency than that of the MPEG-2 Video or the MPEG-4 Visual.
According to the MPEG-2 Video or the MPEG-4 Visual, the coding of a P picture of interest includes motion-compensated prediction which is implemented from only a P picture or an I picture immediately preceding the P picture of interest as viewed in the picture display order. On the other hand, in the picture coding based on the MPEG-4 AVC/H.264, a plurality of pictures can be used as reference pictures for the coding of a picture of interest, and one can be selected therefrom on a block-by-block basis and motion-compensated prediction for the coding of the picture of interest can be performed in response to the selected picture. Furthermore, in addition to pictures preceding a picture of interest as viewed in the picture display order, pictures following the picture of interest can be used as reference pictures.
According to the MPEG-2 Video or the MPEG-4 Visual, the coding of a B picture of interest refers to one reference picture preceding the B picture of interest, one reference picture following the B picture of interest as viewed in the picture display order, or both the preceding and following reference pictures on a simultaneous basis. In the latter case, the mean between the preceding and following reference pictures is used as a predicted picture, and the difference between the B picture of interest and the predicted picture is coded. On the other hand, in the picture coding based on the MPEG-4 AVC/H.264, reference pictures can arbitrarily be selected and used for prediction while being not limited to one preceding reference picture and one following reference picture. Furthermore, a B picture can be used as a reference picture.
According to the MPEG-2 Video, a coding mode is decided on a picture-by-picture basis. According to the MPEG-4 Visual, a coding mode is decided on a VOP-by-VOP basis, where VOP denotes a video object plane. In the picture coding based on the MPEG-4 AVC/H.264, slices are coding units. One picture can be composed of different-type slices such as I slices, P slices, and B slices.
The MPEG-4 AVC/H.264 defines a NAL (network abstraction layer) and a VCL (video coding layer) for encoding and decoding video pixel signals inclusive of a coding mode, motion vectors, DCT coefficients, and others.
A coded bitstream generated in conformity with the MPEG-4 AVC/H.264 is composed of NAL units. Generally, NAL units are VCL NAL units and non-VCL NAL units. Every VCL NAL unit contains data (a coding mode, motion vectors, DCT coefficients, and others) resulting from the coding by the VCL. Every non-VCL NAL unit does not contain such data. Non-VCL NAL units include an SPS (sequence parameter set), a PPS (picture parameter set), and SEI (supplemental enhancement information). The SPS contains parameter information about the coding of the whole of the original video sequence. The PPS contains parameter information about the coding of a picture. The SEI is not essential to the decoding of VCL-coded data.
In the picture coding based on the MPEG-4 AVC/H.264, every picture is divided into slices, and coding units are such slices. VCL NAL units are assigned to slices, respectively. Access units each composed of some NAL units are introduced in order to handle information represented by the coded bitstream on a picture-by-picture basis. One access unit has one coded picture.
In a binocular stereoscopic television system, two cameras take pictures of a scene for viewer's left and right eyes (left and right views) in two different directions respectively, and the pictures are indicated on a common screen to present the stereoscopic pictures to a viewer. Generally, the left-view picture and the right-view picture are handled as independent pictures respectively. Accordingly, the transmission of a signal representing the left-view picture and the transmission of a signal representing the right-view picture are separate from each other. Similarly, the recording of a signal representing the left-view picture and the recording of a signal representing the right-view picture are separate from each other. When the left-view picture and the right-view picture are handled as independent pictures respectively, the necessary total amount of coded picture information is equal to about twice that of information representing only a monoscopic picture (a single two-dimensional picture).
There has been a proposed stereoscopic television system designed so as to reduce the total amount of coded picture information. In the proposed stereoscopic television system, one of left-view and right-view pictures is labeled as a base picture while the other is set as a sub picture.
Japanese patent application publication number 61-144191/1986 discloses a transmission system for stereoscopic pictures. In the system of Japanese application 61-144191/1986, each of left-view and right-view pictures is divided into equal-size small areas called blocks. One of the left-view and right-view pictures is referred to as the first picture while the other is called the second picture. A window equal in shape and size to one block is defined in the first picture. For every block of the second picture, the difference between a signal representing a first-picture portion filling the window and a signal representing the present block of the second picture is calculated as the window is moved throughout a given range centered at the first-picture block corresponding to the present block of the second picture. Detection is made as to the position of the window at which the calculated difference is minimized. The deviation of the detected window position from the position in the first-picture block corresponding to the present block of the second picture is labeled as a position change quantity.
In the system of Japanese application 61-144191/1986, the blocks constituting one of the left-view and right-view pictures are shifted in accordance with the position change quantities. A difference signal is generated which represents the difference between the block-shift-resultant picture and the other picture. The difference signal, information representing the position change quantities, and information representing one of the left-view and right-view pictures are transmitted.
Stereoscopic video coding called Multi-view Profile (ISO/IEC 13818-2/AMD3) has been added to MPEG-2 Video (ISO/IEC 13818-2) in 1996. The MPEG-2 Video Multi-view Profile is 2-layer coding. A base layer of the Multi-view Profile is assigned to a left view, and an enhancement layer is assigned to a right view. The MPEG-2 Video Multi-view Profile implements the coding of stereoscopic video data by steps including motion-compensated prediction, discrete cosine transform, and disparity-compensated prediction. The motion-compensated prediction utilizes a temporal redundancy in the stereoscopic video data for the data compression. The discrete cosine transform utilizes a spatial redundancy in the stereoscopic video data for the data compression. The disparity-compensated prediction utilizes an inter-view redundancy in the stereoscopic video data for the data compression.
Japanese patent application publication number 6-98312/1994 discloses a system for highly efficiently coding multi-view stereoscopic pictures. The system handles a picture to be coded and two or more reference pictures. The reference pictures are selected from temporally-different pictures (different-frame pictures) of plural channels. Alternatively, the to-be-coded picture and the reference pictures may be those pictures different in parallax (disparity) and taken by cameras at slightly different positions respectively. The system includes sections for taking pattern matching between the to-be-coded picture and the reference pictures. Each of the pattern matching sections calculates the error between the to-be-coded picture and the related reference picture, and generates motion-compensation or disparity-compensation vectors.
The system in Japanese application 6-98312/1994 further includes a detector, first and second selectors, and a compensator. The detector senses the smallest among the errors calculated by the pattern matching sections, and generates a selection flag for identifying the reference picture corresponding to the sensed smallest error. The first selector chooses one from the reference pictures in accordance with the selection flag generated by the detector. The second selector responds to the selection flag and chooses, among the compensation vectors generated by the pattern matching sections, ones corresponding to the chosen reference picture. The compensator subjects the chosen reference picture to motion compensation or disparity compensation responsive to the chosen compensation vectors, thereby generating a predicted picture. In the system, a subtracter computes the residual between the to-be-coded picture and the predicted picture, and a DCT-based device encodes the computed residual, the selection flag, and the chosen compensation vectors into a bitstream of a variable-length code.
A conceivable system includes a plurality of processors for decoding a coded bitstream representative of multi-view pictures into decoded video signals on a parallel processing basis, and a three-dimensional display for indicating the decoded video signals. The decoded video signals correspond to different viewpoints, respectively. Generally, disparity-compensated prediction (parallax-compensated prediction) implemented for the decoding of a coded picture of a viewpoint refers to a reference picture which is a decoded picture of another viewpoint. In the parallel-processing-based decoding in the conceivable system, it is difficult to know, at the time of decoding a coded picture of a viewpoint, whether or not the decoding of a coded picture of another viewpoint to obtain a related reference picture has been completed. Therefore, the conceivable system can not utilize disparity-compensated prediction.