Traditional video coding algorithms such as H.264, MPEG-2, and MPEG-4 are suited for broadcast situations where extensive encoding is done using complicated equipment at the broadcast center while relatively simple decoding is done at the user end. The traditional video coding algorithms are less suitable for situations where the encoding is done at the user end which cannot host a computationally expensive encoder. Examples of such situations include wireless video sensors for surveillance, wireless PC cameras, mobile camera phones, and disposable video cameras. In particular, video sensor networks have been envisioned for many applications such as security surveillance, monitoring of disaster zones, domestic monitoring applications, and design of realistic entertainment systems involving multiple parties connected through the network. The rapidly growing video conferencing involving mobile communication of a large number of parties is another example.
The above listed situations require a distributed video coding system having a large number of low-complexity encoders, but one or a few high-complexity decoders. Wyner-Ziv coding of source video data are among the most promising distributed video coding solutions because it allows implementation of light-weight encoders and complex decoders. While Wyner-Ziv decoding is vastly more complex than conventional decoding, the corresponding Wyner-Ziv coding required may be simple.
Wyner-Ziv coding finds its origin in Slepian-Wolf theorem, which has suggested that the two correlated independent identically distributed (i.i.d) sequences X and Y can be encoded losslessly with the same rate as that of the joint encoding as long as the collaborative decoders are employed. Wyner and Ziv extended this theorem to the lossy coding of continuous-valued sources. According to Slepian-Wolf and Wyner-Ziv theorems, it is possible to exploit the correlation only at the decoder. For example, the temporal correlation in video sequences can be exploited by shifting motion estimation from the encoder to the decoder, and low-complexity video coding is thus made possible.
Moreover, owing to separate encoding and joint decoding, Wyner-Ziv bit stream can compensate for the mismatch between the encoder and the decoder, which is a key issue in a video coding system. Wyner-Ziv coding scheme is thus desirable in distributed video coding systems such as that involves mobile devices and wireless sensor networks. The main challenge of distributed video coding using Wyner-Ziv coding scheme is how to explore the correlation at the decoder.
Some practical Wyner-Ziv coding methods have been proposed for video coding in the past. In existing turbo-code based Wyner-Ziv video coding schemes performed in pixel domain, the Wyner-Ziv frame is encoded with a turbo encoder. The delivered parity bits are decoded jointly with temporal side information generated from the previously reconstructed adjacent frames. The temporal side information is obtained by interpolating the adjacent frames. Wyner-Ziv coding has also been further applied to transform domain such as discrete cosine transform (DCT) or wavelet. In particular, some high frequency DCT coefficients as hash words can be transmitted to the decoder for the purpose of assisting motion estimation. When Wyner-Ziv coding is applied to transform domain such as DCT, the spatial redundancy is removed by the encoder at the expense of complexity. Some extra information extracted from the original frame, such as high frequency DCT coefficients, is transmitted to the decoder so that better prediction can be made and the quality of the side information be enhanced.
Besides the turbo-code based Distributed Video Coding (DVC) schemes, a DVC system based on syndrome coding has also been proposed. Within the coset specified by the received syndrome bits, the codeword is selected depending on the side information derived from the adjacent decoded frames.
In the existing Wyner-Ziv coding schemes, only single side information is considered in the decoding process. Most efforts have been made to choose or generate more correlated side information. Furthermore, most of the existing schemes have been focused on removing the spatial redundancy by transforming at the encoder. Some existing approaches consider a frame as a two-dimensional stationary Markov random field and utilize the spatial correlation at the decoder. When it comes to dealing with the continuous-tone source, however, these approaches lead to too many states with the computation on a trellis to be used in practice.