Conventional image and video coding schemes, such as schemes according to the MPEG and ITU series of video coding standards, are well suited for broadcast video and stored media distribution in which there are a huge number of low-complexity receivers (TVs) with decoders, but only a few high-complexity transmitters with encoders.
With such video distribution models, computationally demanding motion estimation techniques are employed in the encoder to exploit temporal correlation among video frames. That process of exploiting temporal redundancy before transmission yields excellent compression efficiency.
FIG. 1 shows a conventional encoder 100 with motion estimation 110. An input video 101 is processed one block at a time. A motion estimator 110 determines a best matching block of a reference frame stored in a frame memory 111 for a current block to be encoded. This best matching block serves as a prediction of the current block. A corresponding motion vector 112 is entropy encoded 150. A difference 120 between the current block of the input video and the predicted block 121, which is generated by a motion-compensated predictor 130, is obtained. The difference signal then undergoes a transform/quantization process 140 to yield a set of quantized transform coefficients 141. These coefficients are entropy encoded 150 to yield a compressed bitstream 109. Performing an inverse transform/quantization 160 on the quantized transform coefficients 121 and adding 170 this result to the motion compensated prediction 121 generates the reference frame, which is stored in the frame memory 111 and used for predicting 130 of successive frames of the input video 101. The output bitstream 109 is generated based on the entropy encoding 150 of motion 112 and texture 141 information.
FIG. 2 shows a conventional decoder 200. An input bitstream 201 is first subject to an entropy decoder 210 that yields both quantized transform coefficients 211 as well as corresponding motion vectors 212. The motion vectors are used by a motion compensated predictor 220 to yield a prediction signal 221. The quantized transform coefficients 211 are inverse transform/quantized 230 and added 240 to the prediction signal 221 to yield the reconstructed video 209. Frames of the reconstructed video, which are used for decoding successive frames, are stored to a frame memory 250.
The above scheme achieves excellent compression efficiency, but has considerable processing and power costs, which is not a problem in large scale commercial applications, such as film and broadcast studios with nearly unlimited resources. However, there are an increasing number of applications in which the capture and encoding of images and video is done with devices that have limited battery and processing power, and limited storage and bandwidth, e.g., cellular telephones, PDAs, environmental sensors, and simple digital cameras with severely limited processing, storage and power resources. Typically, these devices use simple a microprocessor or microcontrollers, and batteries.
Therefore, there is a need for a low complexity encoder, which can provide good compression efficiency and high quality images at an encoder. This paradigm shift in video application needs for video compression is described by R. Puri and K. Ramchandran in “PRISM: A New Robust Video Coding Architecture Based on Distributed Compression Principles,” Proc. 40th Allerton Conference on Communication, Control and Computing, October 2002. In this work they apply to video coding the syndrome encoders and decoders, based on trellis codes, that were previously developed by S. S. Pradhan and K. Ramchandran, “Distributed Source Coding Using Syndromes (DISCUS): Design and Construction,” IEEE Transactions on Information Theory, Vol 49, pp. 626-643, March 2003.
FIG. 3 shows such a prior art low complexity PRISM encoder 300. An input video 301 is classified 310. The classifier estimates a degree of spatio-temporal correlation for each block in a current frame. Based on a squared error difference between the block to be encoded and the co-located block in a previous encoded frame, a class is determined. For instance, a ‘SKIP’ class indicates that the correlation is very high and the current block does not need to be encoded at all, while an ‘INTRA’ class indicates that the correlation is very low and the current block is best encoded using a conventional intra-coding scheme. For correlations between these two extremes, the prior art describes a syndrome-based coding scheme. The idea of syndrome-based coding dates back to 1974, see A. D. Wyner, “Results in the Shannon Theory,” IEEE Transactions on Information Theory, vol. 20, pp. 2-10, 1974.
In the next step, a block transform 320, such as a discrete cosine transform (DCT), is applied to decorrelate the data. The transform coefficients are then subject to a zig-zag scan 330 to order the coefficients into a ID vector of decreasing energy.
A small fraction of the coefficients, which correspond to low-frequency coefficients 331, e.g., approximately 20% of the total coefficients, are subject to a base quantization 340. The quantized coefficients are then input to a syndrome encoder 370 to produce syndrome bits 371. In that particular scheme, a ½-rate trellis code is used for the syndrome coding. A refinement quantization 360 is performed to achieve a target quality for the coefficients that have been syndrome encoded. This operation is just a progressive sub-dividing of the base quantization interval into intervals of size equal to the target quantization step size, where an index 361 of the refinement quantization interval inside the base quantization interval is eventually transmitted to a decoder.
A large fraction of the coefficients, which correspond to higher-frequency coefficients 332, e.g., the remaining 80% of coefficients, are subject to a conventional intra coding, in which the coefficients are subject to conventional quantization 350 and entropy encoding 380 operations as described above.
In addition to the above, a cyclic redundancy check (CRC) of the quantized codeword sequence is calculated by CRC generator 390 to produce CRC bits 391, which are also sent to the decoder. The CRC bits 391 are used at the decoder to determine the best predictor among several candidate predictors. The CRC bits 391 are combined 399 with the outputs from blocks 360, 370, and 380 to produce the output bitstream 309.
FIG. 4 shows the corresponding decoder 400. After deinterleaving 410 an input bitstream 401, the decoder performs motion estimation 405, which outputs a predictor 407 consisting of spatially shifted pixels from the frame memory 406. Multiple predictors with different spatial shifts are generated. A syndrome decoder 440 generates a sequence of quantized coefficients based on the received syndrome bits for each predictor. Because the syndrome encoding is based on trellis codes, a Viterbi process is used to identify the sequence of coefficients that is nearest to the candidate predictor. If the decoded coefficients match the CRC by means of the CRC check 445, then the decoding is declared to be successful. Given the decoded (syndrome) coefficients and the index of the refinement quantization interval sent by the encoder, the inverse base quantization and refinement 420 can be performed to yield a reconstructed set of low-frequency coefficients. The higher-frequency coefficients are recovered through an entropy decoding 450 and inverse quantization operation 460. Both sets of coefficients are then subject to the inverse scan 430 and inverse block transform 470 to yield the reconstructed video 409. The reconstructed frames 408 are also stored into frame memory 406 for the decoding of successive frames.
There are several disadvantages with the above scheme. First, a majority of the transform coefficients, i.e., the high-frequency coefficients, are encoded using conventional quantization 350 and entropy encoding 380 techniques. Complex scenes contain a substantial amount of high-frequency information. Therefore, the prior art scheme has a considerable amount of overhead and leads to loss of efficiency. Second, the prior art syndrome encoding is based on relatively small 8×8 blocks, which decreases an overall rate of compression. Third, the CRC needs to be sufficiently ‘strong’ to reliably reflect the coefficients. Not only is this an overhead for every block, but also, there is no guarantee that the decoding will perform correctly.
Another coding scheme is described by Aaron, et al., in “Towards practical Wyner-Ziv coding of video,” Proc. IEEE International Conference on Image Processing, September 2003. That schema can operate in the pixel or transform domain.
As shown in the prior art encoder 500 of FIG. 5, an input video 501 is partitioned, using a switch 510, into two types of frames: key-frames 511 and Wyner-Ziv frames 512. The key frames are regularly spaced frames. These frames are encoded using conventional intra-frame encoding 520, e.g., DCT, quantization and entropy coding, and coded at the target quality level. The Wyner-Ziv frames 512 are subject to a scalar quantization 513 and a turbo encoder 530, which is one form of syndrome coding. The output bitstream 509 is a combination 540 of bits corresponding to both encoded key-frames and Wyner-Ziv frames. It is noted that in that prior art scheme, syndrome bits are generated only for Wyner-Ziv frames, and not for key-frames, and the intra-encoding is conventional, i.e., both low and high frequency coefficients are encoded.
FIG. 6 shows corresponding prior art decoder 600. The input bitstream 601 includes encoded key-frames and Wyner-Ziv frames. The encoded key frames are decoded using an intra-frame decoder 610 to yield a reconstructed key-frame 611, while the Wyner-Ziv frames are first subject to a turbo decoder 620 to yield a set of syndrome coefficients, which then undergo a reconstruction process 630 to yield the final reconstructed video 609 using the switch 660. The reconstruction is based on the coefficients output by the turbo decoder, as well as interpolated 640 frame data. The reconstructed Wyner-Ziv and key-frames are stored into a frame memory 650 for the decoding of successive frames.
The two main disadvantages of that scheme are the overhead introduced in sending high-quality key-frames, as well as a delay incurred by sending future key-frames that are required for decoding past frames. In terms of conventional coding schemes, the key frames are I-frames and the Wyner-Ziv frames are analogous to B-frames. As with other conventional coding schemes, a distance between the I-frames indicates the amount of the delay. Assuming that a high delay can be tolerated, placing key frames further apart lowers the amount of overhead. However, doing so also lowers the quality of the interpolation, which, in effect, lowers the overall coding efficiency because more syndrome bits are needed to recover from errors in the interpolation.
Clearly, it is desirable to have a coding scheme with low encoding complexity, i.e., similar to intra-only coding, but with high coding efficiency, i.e., closer to that of the best inter-frame coding schemes, that overcomes the disadvantages of the prior art.