The goal of Multiple Description Coding (as described e.g. in V. K. Goyal “Multiple Description Coding: Compression Meets the Network” IEEE Signal Proc. Mag. September 2001 pp. 74-93, is to create several independent bitstreams using an existing video codec (i.e. coder-decoder). Bitstreams can be decoded independently or jointly. The larger the number of the bitstreams decoded, the larger the quality of the output video signal.
Multiple Description Coding (MDC) requires a pre-processing stage upstream of the encoder, to split the video sequence and control redundancy among subsequences. It also requires a post-processing stage downstream of the decoder, to merge the received and successfully decoded substreams. Multiple Description Coding greatly improves error resiliency, because each bitstream can be decoded independently. Also, variable bandwidth/throughput requirements can be managed by transmitting a suitable number of descriptions. However, coding efficiency is somewhat reduced depending on the amount of redundancy left among subsequences.
Multiple Description Coding is essentially analogous to Scalable Coding (also known as Layered Coding). The difference lies in the dependency among bitstreams. The simplest case is when two bitstreams are created. In the case of scalable coding they are referred to as “base layer” and “enhancement layer”, respectively. The latter layer depends on the former layer and cannot be decoded independently therefrom. On the other hand, in the case of Multiple Description Coding, each description can be individually decoded to get a base quality video. As for Scalable Coding, there can be spatial, temporal or SNR (Signal-to-Noise Ratio) Multiple Descriptions (MD).
Replicated headers/syntax and replicated motion vectors among bitstreams greatly impede coding efficiency in SNR MD. Replicated headers/syntax also hinder temporal MD, and motion compensation is less effective because of the increased temporal distance between frames. Spatial MD is hindered by headers/syntax as well. However, contrary to temporal MD, motion compensation is not affected, particularly when 8×8 blocks are split into smaller blocks, as in the latest H.264 codec. Because of this, spatial MD Coding is usually regarded as the best choice for video coding.
The underlying video codec can be either one of the traditional approaches based on DCT (Discrete Cosine Transform) transform and motion compensation (e.g. MPEG-x, H.26x), or one of the more recent codec based on the wavelet 3D transform (e.g. SPHIT). Several schemes exists: overlapping quantization (MDSQ or MDVQ), correlated predictors, overlapped orthogonal transforms, correlating linear transforms (MDTC, e.g. PCT or pairwise correlating transform for 2 MD), correlating filter banks, interleaved spatial-temporal sampling (e.g. video redundancy coding in H.263/H.263+), spatial-temporal polyphase downsampling (PDMD, see below), domain based partitioning (in the signal domain or in a transform domain), FEC based MDC (e.g. using Reed-Solomon codes).
A simple scheme for SNR MD is coding of independent video fluxes created by means of MD quantizers, either scalar or vector (MDSQ, MDVQ). The structure of the MD quantizer controls redundancy. A simple scheme for Spatial/Temporal MD is coding of independent video fluxes created by means of Spatial or Temporal Polyphase Downsampling (PDMD). A programmable Spatial or Temporal low-pass filter controls redundancy.
As an example, Temporal MD can be achieved by separating odd and even frames, creating two subsequences. Alternatively odd and even fields can be separated. Spatial MD is achieved by separating pixels of 2×1 blocks, so that two subsequences are created. Alternatively four subsequences can be created by separating pixels in 2×2 block. The two techniques can be combined. Each subsequence is then fed into a standard video encoder.
Polyphase downsampling (PDMD) for instance is based on a downsampling of pixels of a picture of the video signal. The number of pixels in a row is proportional to a horizontal sampling frequency Fsh, while the number of pixels in a column is proportional to a vertical sampling frequency Fsv. With reference to FIG. 1, where a diagram of the power spectrum, i.e. power P as a function of frequency is shown, given a certain sampling frequency Fs, a spectrum S of the data will extend from the 0 frequency up to the Nyquist frequency, that is Fs/2. It must be understood that the spectrum S of FIG. 1 is simplified, since for pictures the spectrum will be two dimensional and will extend from 0 up to Fsh/2 and from 0 up to Fsv/2.
As can be seen from FIG. 1 (a), the spectrum S is subdivided in a high frequency part HS, corresponding to a high frequency range HR, i.e. the Fs/4 . . . Fs/2 range, and a low frequency part LS, corresponding to a low frequency range LR, i.e. the Fs/2 . . . 0 range. A downsampling operation DS is performed on spectrum S. A downsampling operation in general is performed by discarding some of the samples. Performing a N:1 downsampling means that only one sample out of N samples survive such a downsampling operation. In the frequency domain, the downsampling operation corresponds to an operation of folding the spectrum around a certain frequency.
By way of example, when a 2:1 downsampling operation DS is performed, as represented in FIG. 1 (b), a folded spectrum Sf is generated, where the sampling frequency Fs is reduced to its half, Fs/2. Therefore the Nyquist frequency is reduced from Fs/2 down to Fs/4. The high frequency part HS of the spectrum S that was in the high frequency range HR will be folded in the low frequency range LR, the 0 . . . Fs/4 range. In particular, the frequencies located in proximity of the Nyquist frequency Fs/2 in the original spectrum S will be folded in proximity of the 0 frequency in the folded spectrum Sf. It must be noted that in FIG. 1 (b), as in FIG. 1 (c), that will be described in the following, two folded spectrum Sf are shown, since the 2:1 downsampling operation DS originates two descriptions.
As another example, if a 3:1 downsampling is performed, the sampling frequency is reduced from Fs down to Fs/3. The Nyquist frequency will be reduced to Fs/6. Frequencies that were above Fs/6 will be folded in the following way: frequencies at Fs/2=3*Fs/6 will be folded at Fs/6, frequencies at 2*Fs/6 will be folded at 0 frequency. In general, thus, when N: 1 downsampling is performed, the sampling frequency is reduced to Fs/N, the Nyquist frequency is reduced to Fs/2/N. The frequencies of the spectrum above Fs/2/N will be folded in the allowed range. Frequencies at n*Fs/2/N, where n is an odd integer index will thus be placed at Fs/2/N, frequencies at n*Fs/2/N, where n is an even integer index will be placed at 0.
From the above discussion turns out that, when the PDMD procedure is applied, the high frequencies of the spectrum of the picture are folded over the low frequencies. When standard video codecs are used to compress generated descriptions, such a compression operation CM originates a quantization error Qerr, as shown in FIG. 1c, that will affect the high portion HS of the folded spectrum Sf. In other words, the high portion HS of the spectrum S is highly quantized, to a higher degree with respect to the low portion LS, since the high portion HS is less important from a perceptive point of view. This means however that, when the folded spectrum Sf, as shown in FIG. 1 (d), will be unfolded in an unfolded spectrum Su, by a merging operation US on the decompressed descriptions at the receiver side, the quantization error Qerr will be located in the middle of the unfolded spectrum Su, near the Fs/4 frequency, and the effects of such a quantization error Qerr will be therefore quite noticeable.
Further, it must be noted that each generated description, as a result of the downsampling operation, will have a folded spectrum with a relevant amount of energy in its high portion. This circumstance makes the task of standard video encoders more difficult, since high frequency coefficients will not be low after transform, probably the quantized coefficients will not be zero and, thus, the entropic coding of quantized coefficient will be inefficient. This means that the compression efficiency will be low, the quality for a given bitrate will be low.
The topics considered in the foregoing form the subject of extensive technical literature, as evidenced e.g. by: P. C. Cosman, R. M. Gray, M. Vetterli, “Vector. Quantization of Image Subbands: a Survey”, September 1995; Robert Swann, “MPEG-2 Video Coding over Noisy Channels”, Signal Processing and Communication Lab, University of Cambridge, March 1998; Robert M. Gray “Quantization”, IEEE Transactions on Information Theory, vol. 44, n. 6, October 1998; Vivek K. Goyal, “Beyond Traditional Transform Coding”, University of California, Berkeley, Fall 1998; Jelena Kova{hacek over (c)}ević, Vivek K. Goyal, “Multiple Descriptions—Source-Channel Coding Methods for Communications”, Bell Labs, Innovation for Lucent Technologies, 1998; Jelena Kova{hacek over (c)}ević, Vivek K. Goyal, Ramon Arean, Martin Vetterli, “Multiple Description Transform Coding of Images”, Proceedings of IEEE Conf. on Image Proc., Chicago, October 1998; Sergio Daniel Servetto, “Compression and Reliable Transmission of Digital Image and Video Signals”, University of Illinois at Urbana-Champaign, 1999; Benjamin W. Wah, Xiao Su, Dong Lin, “A survey of error-concealment schemes for real-time audio and video transmission over internet”, Proceedings of IEEE International Symposium on Multimedia Software Engineering, December 2000; John Apostolopoulos, Susie Wee, “Unbalanced Multiple Description Video Communication using Path Diversity”, IEEE International Conference on Image Processing (ICIP), Thessaloniki, Greece, October 2001; John Apostolopoulos, Wai-Tian Tan, Suise Wee, Gregory W. Wornell, “Modeling Path Diversity for Multiple Description Video Communication”, ICASSP, May 2002; John Apostolopoulos, Tina Wong, Wai-Tian Tan, Susie Wee, “On Multiple Description Streaming with Content Delivery Networks”, HP Labs, Palo Alto, February 2002; and John Apostolopoulos, Wai-Tian Tan, Susie J. Wee, “Video Streaming: Concepts, Algorithms and Systems”, HP Labs, Palo Alto, September 2002.