1. Field of the Invention
This invention relates to a method and apparatus for spatially scalable video compression and communication. The coding modules of scalable video compression impacted by this invention include resampling, prediction, quantization, and entropy coding.
2. Description of the Related Art
(Note: This application references a number of different publications as indicated throughout the specification by one or more reference numbers within brackets, e.g., [x]. A list of these different publications ordered according to these reference numbers can be found below in the section entitled “References.” Each of these publications is incorporated by reference herein.)
Scalable Video Coding (SVC) is an important technology extending the capabilities of video compression systems and standards. For example, it is the focus of the Annex G extension of the H.264/MPEG-4 AVC video compression standard. In SVC, a video sequence is encoded into a single bit-stream comprised of multiple layers with progressively higher spatial (screen size), temporal (frame rate), or quality (signal-to-noise or SNR) resolutions:                Spatial (display resolution/definition) scalability: video is coded at multiple spatial resolutions. The data and reconstructed samples of lower resolution layers can be used to predict data or samples at higher resolutions, in order to reduce the incremental bit rate needed to code the higher resolution layers.        Temporal (frame rate) scalability: the motion compensation dependencies are structured so that complete frames (i.e., the corresponding packets) can be dropped from the bitstream.        Quality (SNR) scalability: video may be coded at a single spatial resolution but at different levels of reconstruction quality. The data and reconstructed samples of lower quality layers can be used to predict data or samples at higher qualities, in order to reduce the incremental bit rate needed to code the higher quality layers.        
The higher resolution layers will typically benefit from differential coding from lower layers, via inter-layer prediction, which results in significant bit-rate reduction as well as enhanced streaming flexibility, without retaining multiple independent bit-streams, each of a different spatial, temporal or quality resolution. Thus, SVC is an attractive solution for multimedia streaming and storage in modern network infrastructures serving decoders of diverse display resolutions and channel capacities [1].
To better appreciate the shortcomings of the state-of-the-art, some relevant background information regarding prior art in compression and networking technologies and, in particular, scalable video compression technology, is provided.
As described above, a wide range of multimedia applications such as handheld playback devices, internet radio and television, online media streaming, gaming, and high fidelity teleconferencing heavily rely on advances in video compression. Their success and proliferation have greatly benefited from current video coders, including the H.264/AVC standard.
H.264/AVC
H.264/AVC is a video compression codec that is widely deployed in today's market. It divides every frame into a grid of rectangular blocks, whose sizes vary from 4×4 to 16×16. Each block can be predicted either from previously reconstructed boundary pixels of the same frame (intra-frame mode), or from pixel blocks of previously reconstructed prior frames (inter-frame mode). The prediction error (or residual) block undergoes spatial transformation by the discrete cosine transform (DCT) to output a block of transform coefficients, which are then quantized. The quantization indices are entropy coded for transmission. A common entropy coder, called context-based adaptive binary arithmetic coding, employs an adaptive probability model, conditioned on block size, prediction mode, and the spatially neighboring quantization indexes, to compress the current block quantization indexes.
H.264/AVC Scalable Video Coding Extension (H.264/SVC)
Scalable Video Coding (SVC) is an important technology extending the capabilities of video compression systems and standards. For example, it is the focus of the Annex G extension of the H.264/MPEG-4 AVC video compression standard. In SVC, a video sequence is encoded into a single bit-stream comprised of multiple layers with progressively higher spatial (screen size), temporal (frame rate), or quality (signal-to-noise or SNR) resolutions.
A spatial SVC scheme comprises downsampling a high resolution video sequence to a lower resolution, and coding the two resolutions into separate layers. The lower resolution signal is coded into a base layer via regular H.264/AVC standard codec, while the enhancement layer encodes information necessary to reconstruct the sequence at a higher spatial resolution than the base layer. At the enhancement layer, the current video frame can be predicted from a combination of its reconstruction at the base layer, and a motion compensated reference from prior enhancement layer coded frames. For instance, in the multi-loop design [14], employed in a variety of existing codecs, the prediction mode is selected amongst the two sources such that the rate-distortion cost is minimized. More details on existing spatial SVC approaches are provided in [2]. Note that the encoder effectively subsumes a decoder to generate the reconstructions of the base layer and prior enhancement layer frames. Therefore, once the bitstream is received, a decoder can generate the same prediction, given the already computed encoding decisions transmitted in the bitstream, and using the same reconstructions of the base layer and prior frames as were used by the encoder.
Single-Loop Prediction in H.264/SVC Standard
The standard SVC coder spatially downsamples the original input sequence, and the resultant lower dimension frames are coded by a standard single-layer codec into the base layer. The choice of the down-sampler is not part of the standard, and commonly employed strategies include, for example, the windowed sinc filter and pixel decimation. The enhancement layer prediction of the standard codec follows the single-loop design [2], where the prediction modes include inter-frame motion compensation, a sum of the motion-compensated reference and the upsampled reconstructions of base layer residual, or only the upsampled base layer reconstructions (when it is intra-coded). The encoder selects, per block, amongst all the possible modes the one that minimizes the rate-distortion cost.
An illustration of the process is provided by FIG. 2, which shows the enhancement layer 201 and base layer 202 of frame n−1 203 and frame n 204. To encode block 213 at the enhancement layer 201, the coder performs motion search from previously reconstructed frames in the same layer to generate a motion-compensated reference block 211. It then calculates the position of the base layer block 214 obtained by downsampling the region 212. A separable four-tap polyphase interpolation filter 221, in conjunction with the deblocking operation, is employed in the standard to upsample the base layer reconstruction of 214 to a block 215 at the same spatial dimension as 212. The subblock 216 in the resultant interpolation is collocated with block 213. Either block 211 or block 216 could be used as the enhancement layer prediction, and both are tested by the encoder to find the one that minimizes the rate-distortion cost. Here, for the purpose of illustration, we have implicitly assumed that the base layer block 214 is intra-coded. If block 214 was instead inter-coded, the decoded residuals for the block would be interpolated and summed up with reference block 211 to obtain yet another optional prediction for block 213. A more detailed reference on the single-loop design can be found in [2].
Multi-Loop Prediction in SVC
Another popular alternative is the multi-loop design where, in addition to the modes available in the single-loop design, the base layer reconstructed pixels could be used for enhancement layer prediction even when the base layer block is inter-coded. In other words, the multi-loop design requires full reconstruction of the base layer at the decoder, while the single-loop design could forgo various base layer operations if only the enhancement layer reconstruction is desired. In [4] a variant of the multi-loop design was proposed where enhancement layer prediction employs one of the following modes:                Inter-frame prediction from a motion compensated enhancement layer reference;        Intra-frame prediction from spatially neighboring reconstructed pixels;        Pyramid prediction, or subband prediction (a linear combination of the high-pass filtered motion-compensated enhancement layer reference and the upsampled base layer reconstruction). Effectively, the subband prediction mode uses the base layer reconstruction as prediction for low frequency transform coefficients, and the motion-compensated enhancement layer reference as prediction for high frequency transform coefficients.        
The approach in [4] is reported to provide notable gains over single-loop prediction. In both approaches, multi-loop prediction and single-loop prediction, encoding decisions such as the prediction mode (inter-frame, pyramid prediction, etc.) are transmitted in the bitstream, and a decoder generates the same enhancement layer prediction as the encoder by combining or selecting reconstructions in the same way it was done by the encoder.
Details regarding the prediction tools in the H.264/SVC standard and other leading competitors are described in further detail in the provisional applications cross referenced above and incorporated by reference herein. Note that none of the above described prediction schemes in SVC fully utilize all the information available for enhancement layer prediction. For instance, these prediction modes do not exploit information available from the base layer due to the workings of its quantization operation, which determine an interval where the transform coefficient must lie. This interval information encapsulates all base layer information on the transform coefficient, and hence all the information made available by the base layer for enhancement layer prediction. Note, in particular, that downsampling, upsampling, and prediction are performed in the pixel domain, thus precluding any attempt to optimally utilize such interval information, which is only accessible in the transform domain.