A bitstream of a video sequence includes a bitstream directly generated by encoding the image of each frame, and also includes some parameter information, such as the width and the height of the image, and these parameters are usually organized in a syntax structure called parameter set, and are encoded into a parameter set bitstream, such as a Sequence Parameter Set (SPS) and a Picture Parameter Set (PPS) in H.264/Advanced Video Coding (H.264/AVC).
Broadly speaking, the video sequence includes V (V is more than or equal to 1) viewpoints, and each viewpoint includes N (N is more than or equal to 1) frames of images; when V=1, the video sequence is a common single-viewpoint video sequence; and when V is more than 1, the video sequence is usually called multiview video sequence. When a coding standard such as H.264/AVC is used, one frame of image can be divided into multiple slices, and each slice is encoded and decoded. Slice bitstreams are generated by encoding slices, and each slice bitstream is usually considered to include one slice, and also includes some coding parameters of the slice, such as a identification number of a parameter set referred to by the slice. In a hierarchical structure in H.264/AVC, two levels of parameter sets, i.e. a PPS and an SPS, are used for signalling some parameters of the video sequence. Parameter sets can be referred to by the slices, that is, the parameter sets are found according to the identification numbers of the parameter sets in the slices, and the parameters are acquired from the parameter sets. For example, the slices can acquire the parameters, such as profiles and levels, used by the slices from SPS according to the identification numbers of the parameter sets signalled in the slice; and slice can also acquire the information, such as frame numbers of the slice, from PPSs according to the identification number of the PPS signalled in the slices. By utilizing such a mechanism, the slices can also derive more information in combination with its own information and information in the parameter sets referred to by the slices, for example: the decoding order of the view containing the current slice can be derived as “view order index” based on the order of the view containing the current slice and the view decoding order signalled in the SPS referred to by the current slice. For example, a certain video sequence consists of three viewpoints with viewpoint numbers of 3, 7 and 9 respectively, the viewpoint number of the viewpoint where a certain slice is located is 9, a viewpoint decoding order signalled in the SPS referred to by the slice is 7-3-9 (it is indicated that the viewpoint with the viewpoint number 7 may be decoded first, the decoding order of this viewpoint is 1, and similarly, the decoding order of the viewpoint with the viewpoint number 3 is 2 and the decoding order of the viewpoint with the viewpoint number 9 is 3), and both the decoding order index and view order index of the view containing the slice are 3. The view order index of the view containing the slice is directly included in the information of the slice sometimes, and can be directly obtained without derivation based on the information in the parameter set referred to by the slice.
Along with the development of a technology, particularly the development of a three-dimensional image acquisition technology and three-dimensional display equipment, the video sequence may include the depth information of one or more viewpoints besides the texture information of the one or more viewpoints, and may also include camera parameters of the one or more viewpoints. The camera parameters can include intrinsic parameters such as a focal length and extrinsic parameters such as a distance of a camera relative to a certain reference point; and when the video sequence includes the depth information, the camera parameters usually may include a depth numerical value of the furthest plane and a depth numerical value of the nearest plane, which correspond to the depth information. The camera parameters of the video sequence may be parameters necessary for the decoding of the texture information or depth information of the video sequence, and may also be parameters necessary for the synthesis of the texture information or depth information of virtual viewpoints, so that the camera parameters are required to be encoded into bitstreams for transmission under some applications.
A bitstream is a bit string, and a segment of bitstream is usually formed by converting a plurality of numerical values into codewords in a binary form and connecting each codeword according to a certain sequence. The bitstream parsing rules and code tables specified in a encoding method and a decoding method must be consistent, so that a bitstream generated by encoding can be correctly divided into multiple codewords by the decoding method, an actual meaning corresponding to the bitstream can be found according to the code table (for example which intra-frame prediction modes are used for multiple macroblocks respectively), and a decoding process corresponding to the codewords can be correctly conducted. Meanwhile, the overall bitstream of a video consists of multiple segments of bitstreams representing different meanings, for example, each of a bitstream corresponding to the camera parameters and a bitstream corresponding to picture contents is a part of the bitstream of the video sequence. Therefore, bitstream organization used by an encoder and a decoder for multiple segments of bitstreams are required to be kept consistent.
A process of converting a numerical value, or called a symbol, with an actual meaning into a codeword in a form of bit string is usually called entropy coding, which is a mature. The common entropy coding methods comprise: N-bit fixed-length coding, Exponential-Golomb, Lempel-Ziv-Welch (LZW) coding, Run-length encoding, Shannon coding, Huffman coding and arithmetic coding. Each entropy coding method has its own advantages. A symbol can also be divided into multiple sub-symbols, and a codeword is formed as the bitstreams generated by entropy coding of each sub-symbol. For example, a number can be expressed or approximately expressed as three combinations of multiple bits, and a typical example is that an Institute of Electrical and Electronic Engineers (IEEE) floating point number form specified by an IEEE 754 standard defines a 32-bit single-precision floating point number, a 64-bit double-precision floating point number and a 128-bit extended-precision floating point number; and on the other hand, multiple symbols can be joint-coded into a codeword, for example, two symbols can be mapped to a codeword by a two-dimensional code table.
When the encoder and the decoder reach an agreement that some symbols can be obtained by context information at the decoder end under a certain condition, the symbols are not required to be written into the bitstream, and under such a condition, these codewords are usually called default codewords, and the symbols are called default symbols. Of course, due to design defects of the coding method and the decoding method or other requirements, for example, the error resistance, some bitstreams still include the default codewords, and correspondingly, the coding efficiency is reduced. The encoder and the decoder also reach an agreement about the adoption of a fixed mode for processing sometimes, for example, Inverse Discrete Cosine Transform (IDCT) is adopted for the H.264/AVC standard, and the method adopted under such a condition is usually called a default method.
V (V is more than or equal to 1) video sequences, each of which has different viewpoints, form a multiview video sequence, wherein the video sequence of each viewpoint usually includes N (N is more than or equal to 1) temporally-synchronized frames, and each frame of the video sequence corresponds to M (M is more than or equal to 1) types of camera parameters. The V*M camera parameters corresponding to the V different viewpoints at the same moment form a camera parameter subset, F (F is more than or equal to 1 and less than or equal to N) camera parameter subsets usually form a camera parameter set, and the camera parameter set includes V*F*M camera parameters.
The camera parameters of each type of each viewpoint can form a camera parameter vector including F camera parameters; and furthermore, the camera parameters of a certain type of the V viewpoints are combined into a two-dimensional camera parameter matrix including V*F camera parameters. A simple method for coding the two-dimensional camera parameter matrix is to perform entropy coding to convert each camera parameter in the camera parameter matrix of each camera into a codeword and sequentially connect the codewords to form a bitstream.
Usually, the camera parameters of the camera parameter vector formed by the camera parameters of the same type have certain temporal correlation; meanwhile, the camera parameter vectors of the two-dimensional camera parameter matrix formed by the camera parameter vectors of the same type have certain correlation. For example, multiple temporally-adjacent camera parameters of the video sequence may have the same value, that is, one camera parameter in each camera parameter vector may be predicted according to another temporal parameter, so that a Run-length encoding method can be adopted for coding; one camera parameter in the video sequence of one viewpoint and a parameter of a temporally corresponding frame in the video sequence of another viewpoint may have the same value, that is, a camera parameter may be unidirectionally predicted according to another time related temporal parameter, and a difference between a real value and a predicted value can be recoded in a direct coding or Run-length encoding manner; and one camera parameter in the video sequence of one viewpoint may be the same as the weighted sum of the parameters of the temporally corresponding frames of other two viewpoints, that is, a camera parameter may be bidirectionally predicted according to the other two time related temporal camera parameters, and a difference between the real value and a predicted value can be recoded in a direct coding or Run-length encoding manner. Therefore, the joint coding of the camera parameter vector or the two-dimensional camera parameter matrix can utilize the correlation of the parameters in the vector or the matrix, thereby converting the correlation into a simpler expression manner. For example, the camera parameter vector can be encoded by virtue of Run-length encoding; and the other camera parameter vectors can also be coded in a unidirectional or bidirectional prediction manner by virtue of the encoded camera parameter vector.
In the conventional H.264/AVC standard, the camera parameters can only be included in a grammatical structure Supplemental Enhancement Information (SEI). SEI is independent of video image contents, and is configured to store film introduction, copyright information, data defined by a user and the like, and is not referred to in coding and decoding processes of each slice, or the coding and decoding of a video image are irrelevant to SEI. However, for a three-dimensional video sequence including the depth information, the camera parameters are required by each frame in the coding and decoding processes, for example, the camera parameters are used for synthesizing a target viewpoint image as a viewpoint reference frame, i.e. view synthesis prediction which is often mentioned. For such a three-dimensional video sequence, if the camera parameters are included in SEI according to a technology in the conventional H.264/AVC standard, each slice cannot refer to the required camera parameters in the coding and decoding processes, so that the coding and decoding processes cannot be normally implemented. If the camera parameters are included in the slices, the coding and decoding processes can be normally implemented, but the coding efficiency is very low.
Similar problems exist in similar video coding and decoding standards such as an Audio Video Standard (AVS) and High Efficiency Video Coding.
At present, there have yet been some solutions to the problem, for example:
in Chinese invention patent “video bitstream” with the application number 2011103746640, camera parameter sets are used for transmitting camera parameters, and a citing relationship between slices and the camera parameter sets and a corresponding relationship between the slices and the camera parameters are established, that is, a slice can refer to the camera parameter set of which a camera parameter set identification number is that included in the slice according to the camera parameter set identification number of the camera parameter set in the slice, and is required to acquire the corresponding camera parameters from the camera parameter set referred to by the slice according to a frame number of a frame where the slice is located. The method has the defects that: (1) the slice is required to acquire the corresponding camera parameters from the camera parameter set referred to by the slice according to the frame number of the frame where the slice is located, while each camera parameter set may include the camera parameters of multiple frames, so that the camera parameter set must indicate the frame numbers of the frames included therein, otherwise the slice cannot acquire the corresponding camera parameters according to the frame number of the frame where the slice is located, which causes the waste of coding rate; and (2) as mentioned before, each frame can be divided into multiple slices, and the frame number of the frame where different slices of the same frame are located is the same, so that the camera parameters corresponding to different slices in the same frame are the same in the method; however, in a practical application, the camera parameters corresponding to different slices in the same frame may be different, so that the requirements of the application cannot be met by the method.
In Chinese invention patent “video bitstream and decoding method for same” with the application number 2012100217669, camera parameter sets are used for transmitting camera parameters similarly, and a corresponding relationship between frame images and the camera parameters in the camera parameter sets is established, that is, frames can acquire the camera parameters from the corresponding camera parameter sets according to view order indices of viewpoints where the frames are located and their frame numbers. The method has the defects that: (1) the frames can acquire the camera parameters from the corresponding camera parameter sets according to the view order indices of the viewpoints where the frames are located and their frame numbers, so that the view order indices of the viewpoints corresponding to the camera parameter sets and frame numbers of starting frames in the viewpoints corresponding to the camera parameter sets are required to be included in the camera parameter sets, the frame numbers of the frames in the camera parameter sets are required to be continuous, otherwise the frames cannot acquire the camera parameters from the corresponding camera parameter sets according to the view order indices of the viewpoints where the frames are located and their frame numbers, which causes the waste of coding rate waste; and (2) as mentioned before, each frame can be divided into multiple slices, and the frame number of the frame where different slices of the same frame are located is the same, so that the camera parameters corresponding to different slices in the same frame are the same in the method; however, in a practical application, the camera parameters corresponding to different slices in the same frame may be different, so that the requirements of the application cannot be met by the method.