A three-dimensional video (3D video) sequence comprises a multi-view (usually at least 2 views) video sequence (corresponding to texture information) and corresponding depth sequences (corresponding to depth information), usually also known as an MVD (multi-view video plus depth) format. View synthesis technology can be used to generate one or more synthesized video sequences using a three-dimensional sequence. A traditional binocular stereoscopic video is composed of a video sequence of two fixed views (i.e. a left view and a right view), and is also called as a stereo-pair. A stereo-pair obtained by a binocular camera may have the problem that the parallax of two views is too large. Viewing such a stereo-pair may cause a relatively severe visual fatigue, or such a stereo-pair is not suitable for binocular stereoscopic viewing. By introducing a synthesized video sequence of a virtual view, a video sequence in the stereo-pair and the synthesized video sequence may be used to constitute a stereo-pair which is more suitable for binocular stereoscopic viewing, and here it is referred to as a stereo-pair.
The virtual view video sequence is generated by using view synthesis technology, and the so-called view synthesis mainly uses depth-image-based rendering (DIBR for short) technology, that is, samples of a camera view image is projected onto another virtual view by means of a corresponding depth value thereof and corresponding camera parameters, thereby generating a projected image; and then, after processing of hole filling, filtering, resampling, etc., a virtual view video sequence finally being used for displaying is generated. Projection in view synthesis may use full-pixel precision, i.e. the resolutions of the projected image and the camera view image are the same, and may also use sub-pixel precision of 1/N (N being a positive integer), i.e. the resolution in a horizontal direction of the projected image is N times of that of the camera view image, where N is usually a multiple of 2. Generally, using the sub-pixel projection precision may obtain a virtual view video sequence with a better quality than that of using the full-pixel projection precision, but the calculation complexity is also higher.
The three-dimensional video sequence contains several access units, one access unit comprising pictures (corresponding to texture information) at a certain moment and corresponding depths thereof (corresponding to depth information) of a plurality of (two or more than two) views. The three-dimensional video sequence is encoded to form a three-dimensional video bitstream, and the so-called bitstream is composed of a plurality of binary bits. The encoding method may be an encoding method based on video coding standards of MPEG-2, H.264/AVC, HEVC, AVS, VC-1, etc. It needs to be noted that the video coding standard specifies the syntax of bitstreams and a decoding method for a bitstream conforming to the standard, but does not specify an encoding method for generating the bitstream. However, the applied encoding method must be matched with the decoding method specified by the standard and form a bitstream conforming to the standard, such that a decoder can decode the bitstream correctly, otherwise the decoding process may collapse with errors. The picture obtained by decoding is called a decoded picture or a reconstructed picture.
The video coding standard also specifies syntax units which are not mandatory in terms of decoding a video picture, which, however, may indicate how to render the decoded pictures, or how to assist video decoding, etc. These syntax units are, for example, supplemental enhancement information (SEI, see H.264/AVC standard document) in H.264/AVC standard, video usability information (VUI, see H.264/AVC standard document), MVC VUI parameters extension (see H.264/AVC standard document) in subset sequence parameter set (subset SPS, see H.264/AVC standard document). Syntax units with similar functions in other video coding standards are also included. One syntax unit is a segment of bitstream, containing a group of syntax elements, and these syntax elements are arranged according to the order specified in a video coding standard. Code words corresponding to the syntax elements are connected to form a bitstream.
In order to support virtual view synthesis, the three-dimensional video sequence bitstream may also comprise camera parameter information about various views, e.g. focal length, coordinate location of a view, etc. Generally, various views of the three-dimensional video sequence correspond to a parallel camera arrangement structure (e.g. a test sequence applied by a MPEG 3DV group), that is, the optical axes of all the views are parallel to each other and the optical centres are arranged on one straight line.
The SEI of H.264/AVC is constituted by a plurality of SEI messages. Each SEI message has a type serial number (i.e. payloadType) and bitstream length (i.e. payloadSize). The bitstream of an SEI message comprises a payloadType coded in 8 bits of unsigned integer codes, a payloadSize coded in 8 bits of unsigned integer codes, and a message content bitstream with a length of several bytes (1 byte=8 bits) indicated by the payloadSize. There are also similar methods in other video coding standards.
One piece of information may comprise one or more pieces of basic sub-information. The content of sub-information may be represented in a form of a numerical value within a certain range (generally an integer domain). For example, one piece of sub-information describes two cases, then 0 and 1 may be used to respectively represent these two cases. For another example, one piece of information describes 9 numbers which are multiples of 0.125 from 0 to 1, then integers from 0 to 8 may be used to respectively represent these 9 numbers. The sub-information is numeralized as a syntax element, and the syntax element is coded in an appropriate code according to the range and distribution of the numerical value to form a code word (formed by one or more bits), i.e. the syntax element is encoded as a string of bits. Common codes include n-bit fixed length code, an Exp-Golomb code, arithmetic coding, etc. More particularly, 1 bit of unsigned integer code comprises two code words of 0 and 1; 2 bits of unsigned integer code comprises four code words of 00, 01, 10 and 11 and code words of a 0 order of Exp-Golomb code comprise code words of 1, 010, 011, etc. A flag of binary values usually uses an unsigned integer code of 1 bit (see H.264/AVC standard document), and an index of more than two values is usually coded using an unsigned integer code using n bits (n being a positive integer) or the Exp-Golomb code, etc. The code word is recovered to a numerical value of a syntax element represented in the code word by a corresponding decoding method (e.g. a look-up table method, i.e. looking up a syntax element numerical value corresponding to the code word from a code word table). A plurality of pieces of sub-information may also be jointly numeralized as one syntax element, thereby corresponding to one code word. For example, a combination of two pieces of sub-information may be numbered, and this serial number is taken as one syntax element. The code words corresponding to a plurality of syntax elements are connected according to a specified order by certain encoding and decoding process to form a bitstream.
For example, one piece of information comprises three pieces of sub-information of A, B and C, and the information is encoded as a bitstream. The common methods comprise:
1. respectively numeralizing the three pieces of sub-information of A, B and C as three syntax elements separately, coding in three code words of MA, MB and MC, and connecting the three code words according to a certain order, e.g. MA-MB-MC (i.e. MB appears after MA and before MC in the bitstream) or MC-MA-MB, to form a bitstream; and
2. jointly numeralizing two pieces of sub-information (e.g. A and B) as one syntax element, and converting the other piece of sub-information (e.g. C) into a syntax element, and connecting the code words of the two syntax elements to form a bitstream.
It needs to be noted that some extra code words (or bits) irrelevant to the information may also be inserted between any two adjacent code words (e.g. between MA and MB, or between MB and MC) in the bitstream of the information. For example, filling bits constituted by one or continuous 0 or 1. For another example, there may be some code words corresponding to other information apart from the 3 pieces of sub-information mentioned above.
If a certain piece of sub-information A in one piece of information depends on another piece of sub-information B (i.e. when B indicates a certain special condition, the information indicated by A has not meaning), and such information is coded as a bitstream, and common methods comprise:
1. respectively writing code words corresponding to A and B in the bitstream. When B indicates the special condition, the code word corresponding to A being an appointed code word;
2. writing the code word corresponding to B in the bitstream. When B indicates the special condition, not writing the code word corresponding to A in the bitstream; otherwise, writing the code word corresponding to A after the code word corresponding to B; and
3. numeralizing all valid combinations of A and B as one syntax element, and writing a code word corresponding to the syntax element in the bitstream.
Common methods for decoding a bitstream containing a certain piece of information (constituted by a plurality of pieces of sub-information) comprise, splitting the bitstream into several code words (extra code words irrelevant to the information may be included) according to an appointed syntax element organizing order (e.g. a syntax element organizing order specified by a video coding standard), and decoding code words corresponding to the plurality of pieces of sub-information mentioned above in these code words to obtain a plurality of pieces of sub-information.
In order to improve the stereo perception of the stereo-pair, generally, a method of horizontally shifting of the stereo-pair images may be applied to adjust a parallax range presented on a display. When a left view image is shifted to right with respect to a right view image, a negative parallax increases and a positive parallax decreases; and when the left view image is shifted to left with respect to the right view image, the negative parallax decreases and the positive parallax increases.
A video sequence and depth sequence of a three-dimensional sequence may be respectively coded by multi-view video coding (MVC for short) standard (described in H.264/AVC standard Annex H). MVC specifies that video sequence of each view has its own view order index (VOIdx for short, being a non-negative integer) used for indicating the decoding order of the view video. A video sequence of the view with the minimum view order index (i.e. a video sequence of a view with VOIdx=0) is decoded firstly, and then the video sequence of the view with the secondary minimum view order index (i.e. a video sequence of a view with VOIdx=1) is decoded secondly. However, MVC does not specify a relative position among various views. Due to the lack of position information about views, a stereo display may erroneously output a right view image to the left eye for viewing, thereby leading to an erroneous three-dimensional perception. Moreover, the stereo display (or an auxiliary device such as a set-top box) may generate a sythesized view video serving as a video in a stereo-pair for displaying, thereby adjusting the parallax perception, wherein the stereo display may also need to know some important parameters related to synthesized view, e.g. a synthesized view position, synthesis precision, etc. Therefore, supplemental auxiliary information for instructing the stereo display to display the stereo-pair should also be added to the three-dimensional video sequence bitstream, comprising indication information of the left view picture, structure of the stereo-pair, synthesized view position, synthesis precision, the number of shifting samples, etc.
However, in the related art, the three-dimensional video sequence bitstream does not carry such supplemental auxiliary information required for constructing a stereo-pair. And a decoding end may outputs an arbitrarily constructed stereo-pair for displaying because of the lack of these supplemental auxiliary information, which may result in a poor perception.
Aiming at the problem mentioned above, no effective solution has been presented.