This invention relates to a method and apparatus for decoding an enhanced video stream.
Referring to FIG. 1 of the drawings, a video encoder 10 receives raw video data, typically in the HD-SDI format defined in SMPTE 292M, from a source such as a camera. The video encoder utilizes the HD-SDI data to generate a video elementary stream and supplies the video elementary stream to a video packetizer 14, which produces a video packetized elementary stream (PES) composed of variable length packets. Typically, each packet of the video PES contains one or more video frames. Similarly, an audio encoder (not shown) receives raw audio data from, for example, a microphone and supplies an audio elementary stream to an audio packetizer, which creates an audio PES composed of variable length packets.
The video and audio packetizers supply the video and audio PESs to a transport stream multiplexer 18, which assigns different respective program identifiers (PIDs) to the video PES and the audio PES and organizes the variable-length packets of the video and audio PESs as fixed-length MPEG-2 transport stream (TS) packets each having a header that includes the PID of the PES and a payload containing the PES video (or audio) data.
The single program transport stream (SPTS) that is output by the transport stream multiplexer may be supplied to a program multiplexer 22 that combines the SPTS with other transport streams, conveying other programs, to produce a multi-program transport stream (MPTS). The MPTS is transmitted over a channel to a receiver at which a program demultiplexer 26 separates a selected SPTS from the MPTS and supplies it to a transport stream demultiplexer 30. It will be appreciated by those skilled in the art that the SPTS that is output by the transport stream multiplexer may be transmitted directly to the transport stream demultiplexer without first being combined with other transport streams to create the MPTS but in either case the transport stream demultiplexer receives the transport stream packets of the selected SPTS and separates them on the basis of PID, depacketizes the transport stream packets to recreate the PES packets, and directs the video PES to a so-called video system target decoder (T-STD) 34 and the audio PES to an audio T-STD 38. The subject matter of this application is concerned with decoding a video bitstream and accordingly we will not discuss the audio decoder further.
The video T-STD 34 comprises a system target decoder buffer 40 and a video decoder 42. The STD buffer 40 is functionally equivalent to a transport buffer Tb, a multiplexing buffer Mb, and an elementary stream buffer Eb. The transport buffer Tb receives the video PES at a variable bit rate and outputs the data at a constant bit rate to the multiplexing buffer Mb, which depacketizes the video PES and supplies an encoded bit stream at a constant bit rate to the elementary stream buffer Eb. The elementary stream buffer, which is sometimes referred to as the decoder buffer or as the coded picture buffer (CPB), receives the CBR bitstream and holds the bits for decoding a picture until they are all removed instantaneously by the video decoder at the picture decode time.
It is important to proper operation of the decoder that the decoder buffer should neither overflow, so that bits are lost and a picture cannot be decoded, or underflow, so that the decoder is starved of bits and is unable to decode a picture at the proper time. The supply of bits to the decoder buffer is controlled by a compressed data buffer (CDB) 46 that receives the bitstream from the video encoder 10. The video encoder supplies bits to the CDB at a rate that depends on the fullness of the CDB. The CDB supplies bits to the video packetizer 14 at a constant rate and the multiplexing buffer supplies bits to the decoder buffer at the same rate, and accordingly the fullness of the CDB mirrors the fullness of the decoder buffer. By adjusting supply of bits to the CDB so as to prevent overflow/underflow of the CDB, we avoid underflow/overflow of the decoder buffer.
The video compression standard governing operation of the encoder may specify that the CDB should be no larger than the decoder buffer of a hypothetical reference decoder.
The MPEG-2 transport stream is widely used for delivery of encoded video over an error prone channel. The MPEG-2 system layer also provides for transmission of encoded video in the program stream (PS) in an error free environment. FIG. 1 illustrates transmission of the video PES as a program stream to a video P-STD 50 as an alternative to delivery as a transport stream to the video T-STD 34.
The bitstream produced by the video encoder 10 may comply with the video compression standard that is specified in ISO/IEC 14496-10 (MPEG-4 part 10) Advanced Video Coding (AVC), commonly referred to as H.264/AVC. H.264/AVC uses picture as a collective term for a frame or field. H.264/AVC defines an access unit as a set of network abstraction layer (NAL) units and specifies that the decoding of an access unit always results in a decoded picture. A NAL unit of an access unit produced by an AVC encoder may be a video coding layer (VCL) unit, which contains picture information, or a non-VCL unit, which contains other information, such as closed captioning and timing.
Annex G of H.264/AVC prescribes an extension of H.264/AVC known as scalable video coding or SVC. SVC provides scalable enhancements to the AVC base layer, and the scalability includes spatial scalability, temporal scalability, SNR scalability and bit depth scalability. An SVC encoder is expected to create an H.264/AVC conformant base layer and to add enhancement to that base layer in one or more enhancement layers. Each type of scalability that is employed in a particular implementation of SVC may utilize its own enhancement layer. For example, if the raw video data is in the format known as 1080 HD, composed of frames of 1920×1088 pixels, the base layer may be conveyed by a sub-bitstream composed of access units that can be decoded as pictures that are 704×480 pixels whereas an enhancement layer may be conveyed by a sub-bitstream that is composed of access units that enable a suitable decoder to present pictures that are 1920×1088 pixels by combining the base layer access units with the enhancement layer access units.
A decoder having the capability to decode both a base layer and one or more enhancement layers is referred to herein as an SVC decoder whereas a decoder that cannot recognize an enhancement layer and is able to decode only the base layer access units, and therefore does not have SVC capability, is referred to herein as an AVC decoder.
An access unit produced by an SVC encoder comprises not only the base layer NAL units mentioned above, which may be conveniently referred to as AVC NAL units, but also SVC VCL NAL units and SVC non-VCL NAL units. FIG. 2 shows the sequence of AVC NAL units and SVC NAL units in an SVC access unit as prescribed by the SVC standard. In the event that the encoder produces, for example, two enhancement layers, the non-VCL NAL units for the two enhancement layers are in adjacent blocks of the sequence shown in FIG. 2, between the blocks containing the AVC non-VCL NAL units and the AVC VCL NAL units, and the SVC VCL NAL units for the two enhancement layers are in adjacent blocks of the sequence after the block containing the AVC VCL NAL units.
An SVC decoder that extracts the base layer NAL units from the access unit selects only the AVC non-VCL NAL units and the AVC VCL NAL units.
H.264/AVC specifies a five-bit parameter nal_unit_type, or NUT. Under H.264/AVC, AVC NAL units all have NUT values in the range 1-13. SVC adds NUT values 14, 20 and 15. However, a NAL unit having NUT equal 14 immediately preceding NAL units having NUT equal 5 or 1 signals base layer slices, such that these NAL units, which are non-VCL NAL units, are compatible with AVC and can be decoded by an AVC decoder.
Referring to FIG. 3, an SVC encoder 10′ generates a unitary bitstream that conveys the base layer and, for example, two enhancement layers ENH1 and ENH2. Depending on its capabilities, a decoder might expect to receive, and decode, the base layer only, or the base layer and enhancement layer ENH1, or the base layer and both enhancement layer ENH1 and enhancement layer ENH2. Under the MPEG-2 systems standard and use case for SVC, the encoder may not provide three bitstreams, conveying respectively the base layer only, the base layer and enhancement layer ENH1, and the base layer and both enhancement layer ENH1 and enhancement layer ENH2 and allow the decoder to select whichever bitstream it is able to decode. The encoder must provide the base layer access units and parts of each enhancement layer in separate bitstreams. It would be possible in principle to comply with the MPEG-2 systems standard by using a NAL separator 48 to separate the unitary bitstream into three sub-bitstreams based on the NUT values of the NAL units. One sub-bitstream would convey the base layer NAL units and the other two sub-bitstreams would convey the NAL units for the two enhancement layers respectively. The three sub-bitstreams would pass to respective video packetizers (generally designated 14), which create respective video PESs. The three video PESs would be supplied to a transport stream multiplexer 18 including a T-STD buffer equivalent to the buffer that is included in an SVC T-STD, for the purpose of multiplexing together the outputs of the three packetizers. The multiplexer 18 would assign different PIDs to the three PESs and outputs a transport stream conveying the three layers.
The video T-STD34 shown in FIG. 1 is unable to decode the bitstream conveyed by the transport stream produced by the transport stream multiplexer 18′ shown in FIG. 3 because it provides no capability to reassemble the base layer and enhancement layer access units to produce a complete SVC access unit that can be decoded by an SVC decoder. Neither the H.264/AVC standard nor the MPEG-2 systems standard prescribes how the base layer and enhancement layer access units should be reassembled. Therefore, the architecture shown in FIG. 3 has hitherto lacked practical application.