New digital video coding techniques, which are directed to general improvements in coding efficiency, have a number of new structural characteristics. Specifically, an important new characteristic is scalability. In scalable coding, an original or source signal is represented using two or more hierarchically structured bitstreams. The hierarchical structure implies that decoding of a given bitstream depends on the availability of some or all other bitstreams that are lower in the hierarchy. Each bitstream, together with the bitstreams it depends on, offers a representation of the original signal at a particular temporal, quality (i.e., in terms of Signal-to-Noise Ratio, SNR), or spatial resolution.
It is understood that term ‘scalable’ does not refer to magnitude or scale in terms of numbers, but rather to the ability of the encoding technique to offer a set of different bitstreams corresponding to representations of the original or source signal at different ‘scales’ of resolutions or other qualities in general. The ITU-T H.264 Annex G specification, which is referred to as Scalable Video Coding (SVC), is an example of a video coding standard that offers video coding scalability in all of temporal, quality, or spatial dimensions. SVC is an extension of the H.264 standard (also known as Advanced Video Coding or AVC). An example of an earlier standard, which also offered all three types of scalability, is ISO MPEG-2 (also published as ITU-T H.262). ITU G.729.1 (also known as G.729EV) is an example of a standard offering scalable audio coding. Scalable video coding techniques which are specifically designed for interactive video communication applications such as videoconferencing are described in commonly assigned International patent application PCT/US06/0288365.
The concept of scalability was introduced in video and audio coding as a solution to distribution problems in streaming and broadcasting, and with a view to allow a given communication system to operate with varying access networks (e.g., clients connected with different bandwidths), network conditions (e.g., bandwidth fluctuation), and client devices (e.g., a personal computer that uses a large monitor vs. a handheld device with a much smaller screen).
Commonly assigned International patent application PCT/US06/028365 describes the design of a new type of server called the Scalable Video Communication Server (SVCS). SVCS can advantageously use scalable coded video for high-quality and low-delay video communication, and has a complexity, which is significantly reduced compared to traditional switching or transcoding Multipoint Control Units (MCUs). Similarly, commonly assigned International patent application PCT/US06/62569 describes a Compositing Scalable Video Coding Server (CSVCS), which has the same benefits as an SVCS but produces a single coded output bit stream. International patent application PCT/US07/80089 describes a Multicast Scalable Video Coding Server (MSVCS), which has the same benefits as an SVCS but utilizes available multicast communication channels. For convenience in the following description, the three different types of servers (SVCS, CSVCS, and MSVCS) will be commonly referred to as an SVCS, unless otherwise stated.
The scalable video coding design and the SVCS architecture can be used in further advantageous ways, which are described, for example, in commonly assigned International patent applications PCT/US06/028367, PCT/US06/027368, PCT/US06/061815, PCT/US07/062357, and PCT/US07/063335. These applications describe use of scalable coding techniques and SVCS architectures for effective trunking between servers, reduced jitter buffer delay, error resilience and random access, “thinning” of scalable video bitstreams to improve coding efficiency with reduced packet loss, and rate control, respectively. Further, commonly assigned International patent application PCT/US07/65554 describes techniques for transcoding between scalable video coding formats and other formats.
The hierarchical coding process in a typical scalable video coding system follows a pyramidal design. A first base layer is constructed using a baseline encoding technique, suitable for single-layer coding. In the case of SVC, the base layer is encoded using H.264 AVC. Encoding the base layer in this way has the benefit that the lowest scalability layer is backwards compatible with systems that are able to process only non-scalable video. Additional layers (referred to as enhancement layers) are constructed by further encoding of the difference between the original signal and the decoded output of a lower layer. The process is similar to successive approximation of the original signal.
Each additional enhancement layer improves the fidelity of the coded signal in one of three possible fidelity dimensions: temporal, quality (or SNR), or spatial. Temporal enhancement, when added to the base layer, provides a representation of the signal with a higher number of pictures or frames per second (fps). In quality scalability neither the spatial nor the temporal resolution is changed; rather, the residual coding error is further encoded with finer quantization. Since finer quantization will result in a higher SNR, this form of scalability is often referred to as SNR scalability. SNR scalability is further subdivided into Coarse Grain and Fine Grain Scalability (CGS and FGS, respectively). Their difference is that in the latter case, an embedded coding scheme is used to encode the residual coding error, thus allowing improvement of the lower layer SNR even if only a fraction of the enhancement layer is used in the decoding process. The more bits from the FGS layer are available, the better the improvement of the lower layer's SNR. For this reason the technique is also referred to as “progressive refinement.” In CGS, the entire enhancement layer normally has to be available during the decoding process. Finally, spatial enhancement provides a representation of the signal at a higher spatial resolution (e.g., CIF vs. QCIF). It is noted that in order to construct the residual coding error, i.e., the difference between the original and the decoded output of the lower layer, the output of the lower layer has to be upsampled to the resolution of the original.
High coding efficiency in video coding is achieved by the utilization of effective models for predicting picture content, coupled with appropriate transformation of the resulting prediction errors (e.g., using the discrete cosine transform or integer approximations), quantization, and entropy coding of the resulting quantization levels and side information produced by the prediction and coding process. A simple mechanism for producing multiple representations of a given video signal would be to create two or more separate encodings of corresponding subsets of the original signal. This technique is typically referred to as simulcasting. Scalable video coding achieves further coding gains compared with simulcasting by using lower layers as prediction references for the encoding of higher layers. This inter-layer prediction exploits the inherent redundancy that exists across the three dimensions of a video signal. In a scalable video encoder, lower layer data is thus made available via additional prediction mode options. These additional options give the encoder more flexibility in its task of minimizing the distortion of the coded video signal while maintaining a given bit budget. At the same time, it is noted that the additional options can make the encoding task more complex as more possibilities may be examined.
An important feature of SVC is that single-loop decoding is used. This is achieved by limiting the encoder to use for inter-layer prediction only lower layer data that are available directly by parsing the bitstream. In other words, a decoder decoding a higher layer does not have to fully decode a lower layer (i.e., reconstruct the actual pixels) but instead needs only to parse the lower layer's bitstream data. This significantly reduces the computational requirements of decoders, and is a significant improvement compared to earlier scalable coding designs such as the one used in MPEG-2.
Even with single-loop decoding, however, the encoder's task in scalable coding is computationally demanding, as for every macroblock (MB) or macroblock partition of every layer it has to arrive at a decision in terms of the prediction mode, motion vector(s), and quantizer setting. The computational demand is even more pronounced in real-time applications such as videoconferencing, where pictures have to be processed within a given amount of time and with very little delay.
Consideration is now being given to improving scalable video coding systems so that computationally efficient encoding can be performed. In particular, attention is being directed to improving coding efficiency by appropriate signaling of prediction modes in the coded video signal.