In the transmission of video streams, efforts are continually being made to reduce the amount of data that needs to be transmitted whilst still allowing the moving images to be adequately recreated at the receiving end of the transmission. A video encoder receives an input video stream comprising a sequence of “raw” video frames to be encoded, each representing an image at a respective moment in time. The encoder then encodes each input frame into one of two types of encoded frame: either an intra frame (also known as a key frame), or an inter frame. The purpose of the encoding is to compress the video data so as to incur fewer bits when transmitted over a transmission medium or stored on a storage medium.
An intra frame is compressed using data only from the current video frame being encoded, typically using intra frame prediction coding whereby one image portion within the frame is encoded and signalled relative to another image portion within that same frame. This is similar to static image coding. An inter frame on the other hand is compressed using knowledge of a preceding frame (a reference frame) and allows for transmission of only the differences between that reference frame and the current frame which follows it in time. This allows for much more efficient compression, particularly when the scene has relatively few changes. Inter frame prediction typically uses motion estimation to encode and signal the video in terms of motion vectors describing the movement of image portions between frames, and then motion compensation to predict that motion at the receiver based on the signalled vectors. Various international standards for video communications such as MPEG 1, 2 & 4, and H.261, H.263 & H.264 employ motion estimation and compensation based on regular block based partitions of source frames. Depending on the resolution, frame rate, bit rate and scene, an intra frame can be up to 20 to 100 times larger than an inter frame. On the other hand, an inter frame imposes a dependency relation to previous inter frames up to the most recent intra frame. If any of the frames are missing, decoding the current inter frame may result in errors and artefacts.
These techniques are used for example in the H.264/AVC standard (see T. Wiegand, G. J. Sullivan, G. Bjontegaard, A. Luthra: “Overview of the H.264/AVC video coding standard,” in IEEE Transactions on Circuits and Systems for Video Technology, Volume: 13, Issue: 7, page(s): 560-576, July 2003).
FIG. 7 illustrates a known video encoder for encoding a video stream into a stream of inter frames and interleaved intra frames, e.g. in accordance with the basic coding structure of H.264/AVC. The encoder receives an input video stream comprising a sequence of frames to be encoded (each divided into constituent macroblocks and subdivided into blocks), and outputs quantized transform coefficients and motion data which can then be transmitted to the decoder. The encoder comprises an input 70 for receiving an input macroblock of a video image, a subtraction stage 72, a forward transform stage 74, a forward quantization stage 76, an inverse quantization stage 78, an inverse transform stage 80, an intra frame prediction coding stage 82, a motion estimation & compensation stage 84, and an entropy encoder 86.
The subtraction stage 72 is arranged to receive the input signal comprising a series of input macroblocks, each corresponding to a portion of a frame. From each, the subtraction stage 72 subtracts a prediction of that macroblock so as to generate a residual signal (also sometimes referred to as the prediction error). In the case of intra prediction, the prediction of the block is supplied from the intra prediction stage 82 based on one or more neighbouring regions of the same frame (after feedback via the reverse quantization stage 78 and reverse transform stage 80). In the case of inter prediction, the prediction of the block is provided from the motion estimation & compensation stage 84 based on a selected region of a preceding frame (again after feedback via the reverse quantization stage 78 and reverse transform stage 80). For motion estimation the selected region is identified by means of a motion vector describing the offset between the position of the selected region in the preceding frame and the macroblock being encoded in the current frame.
The forward transform stage 74 then transforms the residuals of the blocks from a spatial domain representation into a transform domain representation, e.g. by means of a discrete cosine transform (DCT). That is to say, it transforms each residual block from a set of pixel values at different Cartesian x and y coordinates to a set of coefficients representing different spatial frequency terms with different wavenumbers kx and ky (having dimensions of 1/wavelength). The forward quantization stage 76 then quantizes the transform coefficients, and outputs quantized and transformed coefficients of the residual signal to be encoded into the video stream via the entropy encoder 86, to thus form part of the encoded video signal for transmission to one or more recipient terminals.
Furthermore, the output of the forward quantization stage 76 is also fed back via the inverse quantization stage 78 and inverse transform stage 80. The inverse transform stage 80 transforms the residual coefficients from the frequency domain back into spatial domain values where they are supplied to the intra prediction stage 82 (for intra frames) or the motion estimation & compensation stage 84 (for inter frames). These stages use the reverse transformed and reverse quantized residual signal along with knowledge of the input video stream in order to produce local predictions of the intra and inter frames (including the distorting effect of having been forward and reverse transformed and quantized as would be seen at the decoder). This local prediction is fed back to the subtraction stage 72 which produces the residual signal representing the difference between the input signal and the output of either the local intra frame prediction stage 82 or the local motion estimation & compensation stage 84. After transformation, the forward quantization stage 76 quantizes this residual signal, thus generating the quantized, transformed residual coefficients for output to the entropy encoder 86. The motion estimation stage 84 also outputs the motion vectors via the entropy encoder 86 for inclusion in the encoded bitstream.
When performing intra frame encoding, the idea is to only encode and transmit a measure of how a portion of image data within a frame differs from another portion within that same frame. That portion can then be predicted at the decoder (given some absolute data to begin with), and so it is only necessary to transmit the difference between the prediction and the actual data rather than the actual data itself. The difference signal is typically smaller in magnitude, so takes fewer bits to encode.
In the case of inter frame encoding, the motion compensation stage 84 is switched into the feedback path in place of the intra frame prediction stage 82, and a feedback loop is thus created between blocks of one frame and another in order to encode the inter frame relative to those of a preceding frame. This typically takes even fewer bits to encode than an intra frame.
FIG. 8 illustrates a corresponding decoder which comprises an entropy decoder 90 for receiving the encoded video stream into a recipient terminal, an inverse quantization stage 92, an inverse transform stage 94, an intra prediction stage 96 and a motion compensation stage 98. The outputs of the intra prediction stage and the motion compensation stage are summed at a summing stage 100.
There are many known motion estimation techniques. Generally they rely on comparison of a block with one or more other image portions from a preceding frame (the reference frame). Each block is predicted from an area of the same size and shape as the block, but offset by any number of pixels in the horizontal or vertical direction or even a fractional number of pixels. The identity of the area used is signalled as overhead (“side information”) in the form of a motion vector. A good motion estimation technique has to balance the requirements of low complexity with high quality video images. It is also desirable that it does not require too much overhead information.
In the standard system described above, it will be noted that the intra prediction coding and inter prediction coding (motion estimation) are performed in the unquantized spatial domain.
More recently, motion estimation techniques operating in the transform domain have attracted attention. However, none of the existing techniques are able to perform with low complexity (thus reducing computational overhead), while also delivering high quality. Hence no frequency domain techniques for motion estimation are currently in practical use.
The VC-1 video codec has an intra prediction mode which operates in the frequency domain, in which the first column and/or first row of AC coefficients in the DCT (Discrete Fourier Transform) domain are predicted from the first column (or first row) of the DCT blocks located immediately to the left or above the processed block. That is to say, coefficients lying at the edge of one block are predicted from the direct spatial neighbours in an adjacent block. For reference see “The VC-1 and H.264 Video Compression Standards for Broadband Video Services”, AvHari Kalva, Jae-Beom Lee, pp. 251.