Digital video sequences, like ordinary motion pictures recorded on film, comprise a sequence of still images, the illusion of motion being created by displaying the images one after the other at a relatively fast rate, typically 15 to 30 frames per second. Because of the relatively fast display rate, images in consecutive frames tend to be quite similar and thus contain a considerable amount of redundant information. For example, a typical scene may comprise some stationary elements, such as background scenery, and some moving areas, which may take many different forms, for example the face of a newsreader, moving traffic and so on. Alternatively, the camera recording the scene may itself be moving, in which case all elements of the image have the same kind of motion. In many cases, this means that the overall change between one video frame and the next is rather small.
Each frame of an uncompressed digital video sequence comprises an array of image pixels. For example, in a commonly used digital video format, known as the Quarter Common Interchange Format (QCIF), a frame comprises an array of 176×144 pixels, in which case each frame has 25,344 pixels. In turn, each pixel is represented by a certain number of bits which carry information about the luminance and/or colour content of the region of the image corresponding to the pixel. Commonly, a so-called YUV colour model is used to represent the luminance and chrominanc content of the image. The luminance, or Y, component represents the intensity (brightness) of the image, while the colour content of the image is represented by two chrominance or colour difference components, labelled U and V.
Colour models based on a luminance/chrominance representation of image content provide certain advantages compared with colour models that are based on a representation involving primary colours (that is Red, Green and Blue, RGB). The human visual system is more sensitive to intensity variations than it is to colour variations and YUV colour models exploit this property by using a lower spatial resolution for the chrominance components (U, V) than for the luminance component (Y). In this way, the amount of information needed to code the colour information in an image can be reduced with an acceptable reduction in image quality.
The lower spatial resolution of the chrominance components is usually attained by sub-sampling. Typically, each frame of a video sequence is divided into so-called ‘macroblocks’, which comprise luminance (Y) information and associated chrominance (U, V) information which is spatially sub-sampled. FIG. 3 illustrates one way in which macroblocks can be formed. FIG. 3a shows a frame of a video sequence represented using a YUV colour model, each component having the same spatial resolution. Macroblocks are formed by representing a region of 16×16 image pixels in the original image (FIG. 3b) as four blocks of luminance information, each luminance block comprising an 8×8 array of luminance (Y) values and two spatially corresponding chrominance components (U and V) which are sub-sampled by a factor of two in the x and y directions to yield corresponding arrays of 8×8 chrominance (U, V) values (see FIG. 3c). According to certain video coding recommendations, such as International Telecommunications Union (ITU-T) recommendation H.26L, the fundamental block size used within the macroblocks can be other than 8×8, for example 4×8 or 4×4. (See G. Bjontegaard, “H.26L Test Model Long Term Number 8 (TML-8) draft 0”, VCEG-N1O, June 2001, section 2.3).
A QCIF image comprises 11×9 macroblocks. If the luminance blocks and chrominance blocks are represented with 8 bit resolution (that is by numbers in the range 0 to 255), the total number of bits required per macroblock is (16×16×8)+2×(8×8×8)=3072 bits. The number of bits needed to represent a video frame in QCIF format is thus 99×3072=304,128 bits. This means that the amount of data required to transmit/record/display an uncompressed video sequence in QCIF format, represented using a YUV colour model, at a rate of 30 frames per second, is more than 9 Mbps (million bits per second). This is an extremely high data rate and is impractical for use in video recording, transmission and display applications because of the very large storage capacity, transmission channel capacity and hardware performance required.
If video data is to be transmitted in real-time over a fixed line network such as an ISDN (Integrated Services Digital Network) or a conventional PSTN (Public Switched Telephone Network), the available data transmission bandwidth is typically of the order of 64 kbits/s. In mobile videotelephony, where transmission takes place at least in part over a radio communications link, the available bandwidth can be as low as 20 kbits/s. This means that a significant reduction in the amount of information used to represent video data must be achieved in order to enable transmission of digital video sequences over low bandwidth communication networks. For this reason, video compression techniques have been developed which reduce the amount of information transmitted while retaining an acceptable image quality.
Video compression methods are based on reducing the redundant and perceptually irrelevant parts of video sequences. The redundancy in video sequences can be categorised into spatial, temporal and spectral redundancy. ‘Spatial redundancy’ is the term used to describe the correlation (similarity) between neighbouring pixels within a frame. The term ‘temporal redundancy’ expresses the fact that objects appearing in one frame of a sequence are likely to appear in subsequent frames, while ‘spectral redundancy’ refers to the correlation between different colour components of the same image.
Sufficiently efficient compression cannot usually be achieved by simply reducing the various forms of redundancy in a given sequence of images. Thus, most current video encoders also reduce the quality of those parts of the video sequence which are subjectively the least important. In addition, the redundancy of the compressed video bit-stream itself is reduced by means of efficient loss-less encoding. Generally, this is achieved using a technique known as entropy coding.
There is often a significant amount of spatial redundancy between the pixels that make up each frame of a digital video sequence. In other words, the value of any pixel within a frame of the sequence is substantially the same as the value of other pixels in its immediate vicinity. Typically, video coding systems reduce spatial redundancy using a technique known as ‘block-based transform coding’, in which a mathematical transformation is applied to the pixels of an image, on a macroblock-by-macroblock basis. Transform coding translates the image data from a representation comprising pixel values to a form comprising a set of coefficient values, each of which is a weighting factor (multiplier) for a basis function of the transform in question. By using certain mathematical transformations, such as the two-dimensional Discrete Cosine Transform (DCT), the spatial redundancy within a frame of a digital video sequence can be significantly reduced, thereby producing a more compact representation of the image data.
Frames of a video sequence which are compressed using block-based transform coding, without reference to any other frame within the sequence, are referred to as INTRA-coded or I-frames. Additionally, and where possible, blocks of INTRA-coded frames are predicted from previously coded blocks within the same frame. This technique, known as INTRA-prediction, has the effect of further reducing the amount of data required to represent an INTRA-coded frame.
Generally, video coding systems not only reduce the spatial redundancy within individual frames of a video sequence, but also make use of a technique known as ‘motion-compensated prediction’, to reduce the temporal redundancy in the sequence. Using motion-compensated prediction, the image content of some (often many) frames in a digital video sequence is ‘predicted’ from one or more other frames in the sequence, known as ‘reference’ or ‘anchor’ frames. Prediction of image content is achieved by tracking the motion of objects or regions of an image between a frame to be coded (compressed) and the reference frame(s) using ‘motion vectors’. In general, the reference frame(s) may precede the frame to be coded or may follow it in the video sequence. As in the case of INTRA-coding, motion compensated prediction of a video frame is typically performed macroblock-by-macroblock.
Frames of a video sequence which are compressed using motion-compensated prediction are generally referred to as INTER-coded or P-frames. Motion-compensated prediction alone rarely provides a sufficiently precise representation of the image content of a video frame and therefore it is typically necessary to provide a so-called ‘prediction error’ (PE) frame with each INTER-coded frame. The prediction error frame represents the difference between a decoded version of the INTER-coded frame and the image content of the frame to be coded. More specifically, the prediction error frame comprises values that represent the difference between pixel values in the frame to be coded and corresponding reconstructed pixel values formed on the basis of a predicted version of the frame in question. Consequently, the prediction error frame has characteristics similar to a still image and block-based transform coding can be applied in order to reduce its spatial redundancy and hence the amount of data (number of bits) required to represent it.
In order to illustrate the operation of a video coding system in greater detail, reference will now be made to FIGS. 1 and 2. FIG. 1 is a schematic diagram of a generic video encoder that employs a combination of INTRA- and INTER-coding to produce a compressed (encoded) video bit-stream. A corresponding decoder is illustrated in FIG. 2 and will be described later in the text.
The video encoder 100 comprises an input 101 for receiving a digital video signal from a camera or other video source (not shown). It also comprises a transformation unit 104 which is arranged to perform a block-based discrete cosine transform (DCT), a quantiser 106, an inverse quantiser 108, an inverse transformation unit 110, arranged to perform an inverse block-based discrete cosine transform (IDCT), combiners 112 and 116, and a frame store 120. The encoder further comprises a motion estimator 130, a motion field coder 140 and a motion compensated predictor 150. Switches 102 and 114 are operated co-operatively by control manager 160 to switch the encoder between an INTRA-mode of video encoding and an INTER-mode of video encoding. The encoder 100 also comprises a video multiplex coder 170 which forms a single bit-stream from the various types of information produced by the encoder 100 for further transmission to a remote receiving terminal or, for example, for storage on a mass storage medium, such as a computer hard drive (not shown).
Encoder 100 operates as follows. Each frame of uncompressed video provided from the video source to input 101 is received and processed macroblock by macroblock, preferably in raster-scan order. When the encoding of a new video sequence starts, the first frame to be encoded is encoded as an INTRA-coded frame. Subsequently, the encoder is programmed to code each frame in INTER-coded format, unless one of the following conditions is met: 1) it is judged that the current macroblock of the frame being coded is so dissimilar from the pixel values in the reference frame used in its prediction that excessive prediction error information is produced, in which case the current macroblock is coded in INTRA-coded format; 2) a predefined INTRA frame repetition interval has expired; or 3) feedback is received from a receiving terminal indicating a request for a frame to be provided in INTRA-coded format.
The occurrence of condition 1) is detected by monitoring the output of the combiner 116. The combiner 116 forms a difference between the current macroblock of the frame being coded and its prediction, produced in the motion compensated prediction block 150. If a measure of this difference (for example a sum of absolute differences of pixel values) exceeds a predetermined threshold, the combiner 116 informs the control manager 160 via a control line 119 and the control manager 160 operates the switches 102 and 114 via control line 113 so as to switch the encoder 100 into INTRA-coding mode. In this way, a frame which is otherwise encoded in INTER-coded format may comprise INTRA-coded macroblocks. Occurrence of condition 2) is monitored by means of a timer or frame counter implemented in the control manager 160, in such a way that if the timer expires, or the frame counter reaches a predetermined number of frames, the control manager 160 operates the switches 102 and 114 via control line 113 to switch the encoder into INTRA-coding mode. Condition 3) is triggered if the control manager 160 receives a feedback signal from, for example, a receiving terminal, via control line 121 indicating that an INTRA frame refresh is required by the receiving terminal. Such a condition may arise, for example, if a previously transmitted frame is badly corrupted by interference during its transmission, rendering it impossible to decode at the receiver. In this situation, the receiving decoder issues a request for the next frame to be encoded in INTRA-coded format, thus re-initialising the coding sequence.
Operation of the encoder 100 in INTRA-coding mode will now be described. In INTRA-coding mode, the control manager 160 operates the switch 102 to accept video input from input line 118. The video signal input is received macroblock by macroblock from input 101 via the input line 118. As they are received, the blocks of luminance and chrominance values which make up the macroblock are passed to the DCT transformation block 104, which performs a 2-dimensional discrete cosine transform on each block of values, producing a 2-dimensional array of DCT coefficients for each block. DCT transformation block 104 produces an array of coefficient values for each block, the number of coefficient values depending on the nature of the blocks which make up the macroblock. For example, if the fundamental block size used in the macroblock is 4×4, DCT transformation block 104 produces a 4×4 array of DCT coefficients for each block. If the block size is 8×8, an 8×8 array of DCT coefficients is produced.
The DCT coefficients for each block are passed to the quantiser 106, where they are quantised using a quantisation parameter QP. Selection of the quantisation parameter QP is controlled by the control manager 160 via control line 115. Quantisation introduces a loss of information, as the quantised coefficients have a lower numerical precision than the coefficients originally generated by the DCT transformation block 104. This provides a further mechanism by which the amount of data required to represent each image of the video sequence can be reduced. However, unlike the DCT transformation, which is essentially lossless, the loss of information introduced by quantisation causes an irreversible degradation in image quality. The greater the degree of quantisation applied to the DCT coefficients, the greater the loss of image quality.
The quantised DCT coefficients for each block are passed from the quantiser 106 to the video multiplex coder 170, as indicated by line 125 in FIG. 1. The video multiplex coder 170 orders the quantised transform coefficients for each block using a zigzag scanning procedure. This operation converts the two-dimensional array of quantised transform coefficients into a one-dimensional array. Typical zigzag scanning orders, such as that for a 4×4 array shown in FIG. 4, order the coefficients approximately in ascending order of spatial frequency. This also tends to order the coefficients according to their values, such that coefficients positioned earlier in the one-dimensional array are more likely to have larger absolute values than coefficients positioned later in the array. This is because lower spatial frequencies tend to have higher amplitudes within the image blocks. Consequently, values occurring towards the end of the one-dimensional array of quantised transform coefficients tend to be zeros.
Typically, the video multiplex coder 170 represents each non-zero quantised coefficient in the one dimensional array by two values, referred to as level and run. Level is the value of the quantised coefficient and run is the number of consecutive zero-valued coefficients preceding the coefficient in question. The run and level values for a given coefficient are ordered such that the level value precedes the associated run value. A level value equal to zero is used to indicate that there are no more non-zero coefficient values in the block. This O-level value is referred to as an EOB (end-of-block) symbol.
The run and level values are further compressed in the video multiplex coder 170 using entropy coding. Entropy coding is a lossless operation, which exploits the fact that symbols within a data set to be coded generally have different probabilities of occurrence. Since certain values of levels and runs are more likely to occur than others, entropy coding techniques can be used effectively to reduce the number of bits required to code the run and level values which represent the quantised transform coefficients. A number of different methods can be used to implement entropy coding. One method commonly used in video coding systems is known as Variable Length Coding (VLC). Generally, the VLC codewords are sequences of bits (i.e. 0's and 1's) constructed so that the length of a given codeword corresponds to the frequency of occurrence of the symbol it represents. Thus, instead of using a fixed number of bits to represent each symbol to be coded, a variable number of bits is assigned such that symbols which are more likely to occur are represented with VLC codewords having fewer bits. As the lengths of the codewords may be (and generally are) different, they must also be constructed in such as to be uniquely decodable. In other words, if a valid sequence of bits having a certain finite length is received by a decoder, there should be only one possible input sequence of symbols corresponding to the received sequence of bits. In the video encoder shown in FIG. 1, entropy coding of the run and level parameters using variable length coding may be implemented by means of look-up tables which define the mapping between each possible symbol in the data set to be coded and its corresponding variable length code. Such look-up tables are often defined by statistical analysis of training material comprising symbols identical to those to be coded and having similar statistical properties.
An alternative method of entropy coding, known as arithmetic coding, can also be used to convert the run and level values into variable length codewords. In arithmetic coding a group of symbols, for example the run and level values for a block of quantised transform coefficients, are coded as a single floating point decimal number. This approach to entropy coding, in which a group of symbols is encoded using a single codeword, can lead to improved compression efficiency compared with methods such as variable length coding which represent each symbol independently. Further details concerning arithmetic coding can be found from Vasudev Bhaskaran and Konstantinos Konstantinides “Image and Video Compression Standards” 2nd Edition, Kluwer Academic Publishers, 1999, ISBN 0-7923-9952-8, Section 2.9, for example.
Once the run and level values have been entropy coded using an appropriate method, the video multiplex coder 170 further combines them with control information, also entropy coded using a variable length coding method appropriate for the kind of information in question, to form a single compressed bit-stream of coded image information 135. While entropy coding has been described in connection with operations performed by the video multiplex coder 170, it should be noted that in alternative implementations a separate entropy coding unit may be provided.
A locally decoded version of the macroblock is also formed in the encoder 100. This is done by passing the quantised transform coefficients for each block, output by quantiser 106, through inverse quantiser 108 and applying an inverse DCT transform in inverse transformation block 110. In this way a reconstructed array of pixel values is constructed for each block of the macroblock. The resulting decoded image data is input to combiner 112. In INTRA-coding mode, switch 114 is set so that the input to the combiner 112 via switch 114 is zero. In this way, the operation performed by combiner 112 is equivalent to passing the decoded image data unaltered.
As subsequent macroblocks of the current frame are received and undergo the previously described encoding and local decoding steps in blocks 104, 106, 108, 110 and 112, a decoded version of the INTRA-coded frame is built up in frame store 120. When the last macroblock of the current frame has been INTRA-coded and subsequently decoded, the frame store 120 contains a completely decoded frame, available for use as a prediction reference frame in coding a subsequently received video frame in INTER-coded format.
Operation of the encoder 100 in INTER-coding mode will now be described. In INTER-coding mode, the control manager 160 operates switch 102 to receive its input from line 117, which comprises the output of combiner 116. The combiner 116 receives the video input signal macroblock by macroblock from input 101. As combiner 116 receives the blocks of luminance and chrominance values which make up the macroblock, it forms corresponding blocks of prediction error information. The prediction error information represents the difference between the block in question and its prediction, produced in motion compensated prediction block 150. More specifically, the prediction error information for each block of the macroblock comprises a two-dimensional array of values, each of which represents the difference between a pixel value in the block of luminance or chrominance information being coded and a decoded pixel value obtained by forming a motion-compensated prediction for the block, according to the procedure described below. Thus, in a situation where each macroblock comprises, for example, an assembly of 4×4 blocks comprising luminance and chrominance values the prediction error information for each block of the macroblock similarly comprises a 4×4 array of prediction error values.
The prediction error information for each block of the macroblock is passed to DCT transformation block 104, which performs a two-dimensional discrete cosine transform on each block of prediction error values to produce a two-dimensional array of DCT transform coefficients for each block. DCT transformation block 104 produces an array of coefficient values for each prediction error block, the number of coefficient values depending on the nature of the blocks which make up the macroblock. For example, if the fundamental block size used in the macroblock is 4×4, DCT transformation block 104 produces a 4×4 array of DCT coefficients for each prediction error block. If the block size is 8×8, an 8×8 array of DCT coefficients is produced.
The transform coefficients for each prediction error block are passed to quantiser 106 where they are quantised using a quantisation parameter QP, in a manner analogous to that described above in connection with operation of the encoder in INTRA-coding mode. Again, selection of the quantisation parameter QP is controlled by the control manager 160 via control line 115.
The quantised DCT coefficients representing the prediction error information for each block of the macroblock are passed from quantiser 106 to video multiplex coder 170, as indicated by line 125 in FIG. 1. As in INTRA-coding mode, the video multiplex coder 170 orders the transform coefficients for each prediction error block using the previously described zigzag scanning procedure (see FIG. 4) and then represents each non-zero quantised coefficient as a level and a run value. It further compresses the run and level values using entropy coding, in a manner analogous to that described above in connection with INTRA-coding mode. Video multiplex coder 170 also receives motion vector information (described in the following) from motion field coding block 140 via line 126 and control information from control manager 160. It entropy codes the motion vector information and control information and forms a single bit-stream of coded image information, 135 comprising the entropy coded motion vector, prediction error and control information.
The quantised DCT coefficients representing the prediction error information for each block of the macroblock are also passed from quantiser 106 to inverse quantiser 108. Here they are inverse quantised and the resulting blocks of inverse quantised DCT coefficients are applied to inverse DCT transform block 110, where they undergo inverse DCT transformation to produce locally decoded blocks of prediction error values. The locally decoded blocks of prediction error values are then input to combiner 112. In INTER-coding mode, switch 114 is set so that the combiner 112 also receives predicted pixel values for each block of the macroblock, generated by motion-compensated prediction block 150. The combiner 112 combines each of the locally decoded blocks of prediction error values with a corresponding block of predicted pixel values to produce reconstructed image blocks and stores them in frame store 120.
As subsequent macroblocks of the video signal are received from the video source and undergo the previously described encoding and decoding steps in blocks 104, 106, 108, 110, 112, a decoded version of the frame is built up in frame store 120. When the last macroblock of the frame has been processed, the frame store 120 contains a completely decoded frame, available for use as a prediction reference frame in encoding a subsequently received video frame in INTER-coded format.
Formation of a prediction for a macroblock of the current frame will now be described. Any frame encoded in INTER-coded format requires a reference frame for motion-compensated prediction. This means, necessarily, that when encoding a video sequence, the first frame to be encoded, whether it is the first frame in the sequence, or some other frame, must be encoded in INTRA-coded format. This, in turn, means that when the video encoder 100 is switched into INTER-coding mode by control manager 160, a complete reference frame, formed by locally decoding a previously encoded frame, is already available in the frame store 120 of the encoder. In general, the reference frame is formed by locally decoding either an INTRA-coded frame or an INTER-coded frame.
The first step in forming a prediction for a macroblock of the current frame is performed by motion estimation block 130. The motion estimation block 130 receives the blocks of luminance and chrominance values which make up the current macroblock of the frame to be coded via line 128. It then performs a block matching operation in order to identify a region in the reference frame which corresponds substantially with the current macroblock. In order to perform the block matching operation, motion estimation block accesses reference frame data stored in frame store 120 via line 127. More specifically, motion estimation block 130 performs block-matching by calculating difference values (e.g. sums of absolute differences) representing the difference in pixel values between the macroblock under examination and candidate best-matching regions of pixels from a reference frame stored in the frame store 120. A difference value is produced for candidate regions at all possible offsets within a predefined search region of the reference frame and motion estimation block 130 determines the smallest calculated difference value. The offset between the macroblock in the current frame and the candidate block of pixel values in the reference frame that yields the smallest difference value defines the motion vector for the macroblock in question.
Once the motion estimation block 130 has produced a motion vector for the macroblock, it outputs the motion vector to the motion field coding block 140. The motion field coding block 140 approximates the motion vector received from motion estimation block 130 using a motion model comprising a set of basis functions and motion coefficients. More specifically, the motion field coding block 140 represents the motion vector as a set of motion coefficient values which, when multiplied by the basis functions, form an approximation of the motion vector. Typically, a translational motion model having only two motion coefficients and basis functions is used, but motion models of greater complexity may also be used.
The motion coefficients are passed from motion field coding block 140 to motion compensated prediction block 150. Motion compensated prediction block 150 also receives the best-matching candidate region of pixel values identified by motion estimation block 130 from frame store 120. Using the approximate representation of the motion vector generated by motion field coding block 140 and the pixel values of the best-matching candidate region of pixels from the reference frame, motion compensated prediction block 150 generates an array of predicted pixel values for each block of the macroblock. Each block of predicted pixel values is passed to combiner 116 where the predicted pixel values are subtracted from the actual (input) pixel values in the corresponding block of the current macroblock. In this way a set of prediction error blocks for the macroblock is obtained.
Operation of the video decoder 200, shown in FIG. 2 will now be described. The decoder 200 comprises a video multiplex decoder 270, which receives an encoded video bit-stream 135 from the encoder 100 and demultiplexes it into its constituent parts, an inverse quantiser 210, an inverse DCT transformer 220, a motion compensated prediction block 240, a frame store 250, a combiner 230, a control manager 260, and an output 280.
The control manager 260 controls the operation of the decoder 200 in response to whether an INTRA- or an INTER-coded frame is being decoded. An INTRA/INTER trigger control signal, which causes the decoder to switch between decoding modes is derived, for example, from picture type information provided in a header portion of each compressed video frame received from the encoder. The INTRA/INTER trigger control signal is extracted from the encoded video bit-stream by the video multiplex decoder 270 and is passed to control manager 260 via control line 215.
Decoding of an INTRA-coded frame is performed on a macroblock-by-macroblock basis, each macroblock being decoded substantially as soon as encoded information relating to it is received in the video bit-stream 135. The video multiplex decoder 270 separates the encoded information for the blocks of the macroblock from possible control information relating to the macroblock in question. The encoded information for each block of an INTRA-coded macroblock comprises variable length codewords representing the entropy coded level and run values for the non-zero quantised DCT coefficients of the block. The video multiplex decoder 270 decodes the variable length codewords using a variable length decoding method corresponding to the encoding method used in the encoder 100 and thereby recovers the level and run values. It then reconstructs the array of quantised transform coefficient values for each block of the macroblock and passes them to inverse quantiser 210. Any control information relating to the macroblock is also decoded in the video multiplex decoder using an appropriate decoding method and is passed to control manager 260. In particular, information relating to the level of quantisation applied to the transform coefficients is extracted from the encoded bit-stream by video multiplex decoder 270 and provided to control manager 260 via control line 217. The control manager, in turn, conveys this information to inverse quantiser 210 via control line 218. Inverse quantiser 210 inverse quantises the quantised DCT coefficients for each block of the macroblock according to the control information and provides the now inverse quantised DCT coefficients to inverse DCT transformer 220.
Inverse DCT transformer 220 performs an inverse DCT transform on the inverse quantised DCT coefficients for each block of the macroblock to form a decoded block of image information comprising reconstructed pixel values. As motion-compensated prediction is not used in the encoding/decoding of INTRA-coded macroblocks, control manager 260 controls combiner 230 in such a way as to prevent any reference information being used in the decoding of the INTRA-coded macroblock. The reconstructed pixel values for each block of the macroblock are passed to the video output 280 of the decoder where, for example, they can be provided to a display device (not shown). The reconstructed pixel values for each block of the macroblock are also stored in frame store 250. As subsequent macroblocks of the INTRA-coded frame are decoded and stored, a decoded frame is progressively assembled in the frame store 250 and thus becomes available for use as a reference frame for motion compensated prediction in connection with the decoding of subsequently received INTER-coded frames.
INTER-coded frames are also decoded macroblock by macroblock, each INTER-coded macroblock being decoded substantially as soon as encoded information relating to it is received in the bit-stream 135. The video multiplex decoder 270 separates the encoded prediction error information for each block of an INTER-coded macroblock from encoded motion vector information and possible control information relating to the macroblock in question. As explained in the foregoing, the encoded prediction error information for each block of the macroblock comprises variable length codewords representing the entropy coded level and run values for the nonzero quantised transform coefficients of the prediction error block in question. The video multiplex decoder 270 decodes the variable length codewords using a variable length decoding method corresponding to the encoding method used in the encoder 100 and thereby recovers the level and run values. It then reconstructs an array of quantised transform coefficient values for each prediction error block and passes them to inverse quantiser 210. Control information relating to the INTER-coded macroblock is also decoded in the video multiplex decoder 270 using an appropriate decoding method and is passed to control manager 260. Information relating to the level of quantisation applied to the transform coefficients of the prediction error blocks is extracted from the encoded bit-stream and provided to control manager 260 via control line 217. The control manager, in turn, conveys this information to inverse quantiser 210 via control line 218. Inverse quantiser 210 inverse quantises the quantised DCT coefficients representing the prediction error information for each block of the macroblock according to the control information and provides the now inverse quantised DCT coefficients to inverse DCT transformer 220. The inverse quantised DCT coefficients representing the prediction error information for each block are then inverse transformed in the inverse DCT transformer 220 to yield an array of reconstructed prediction error values for each block of the macroblock.
The encoded motion vector information associated with the macroblock is extracted from the encoded video bit-stream 135 by video multiplex decoder 270 and is decoded. The decoded motion vector information thus obtained is passed to motion compensated prediction block 240, which reconstructs a motion vector for the macroblock using the same motion model as that used to encode the INTER-coded macroblock in encoder 100. The reconstructed motion vector approximates the motion vector originally determined by motion estimation block 130 of the encoder. The motion compensated prediction block 240 of the decoder uses the reconstructed motion vector to identify the location of a region of reconstructed pixels in a prediction reference frame stored in frame store 250. The reference frame may be, for example, a previously decoded INTRA-coded frame, or a previously decoded INTER-coded frame. In either case, the region of pixels indicated by the reconstructed motion vector is used to form a prediction for the macroblock in question. More specifically, the motion compensated prediction block 240 forms an array of pixel values for each block of the macroblock by copying corresponding pixel values from the region of pixels identified in the reference frame. The prediction, that is the blocks of pixel values derived from the reference frame, are passed from motion compensated prediction block 240 to combiner 230 where they are combined with the decoded prediction error information. In practice, the pixel values of each predicted block are added to corresponding reconstructed prediction error values output by inverse DCT transformer 220. In this way an array of reconstructed pixel values for each block of the macroblock is obtained. The reconstructed pixel values are passed to the video output 280 of the decoder and are also stored in frame store 250. As subsequent macroblocks of the INTER-coded frame are decoded and stored, a decoded frame is progressively assembled in the frame store 250 and thus becomes available for use as a reference frame for motion-compensated prediction of other INTER-coded frames.
Entropy coding of the run and level values associated with the quantised transform coefficients using the technique of variable length coding (VLC) will now be examined in greater detail by means of an example. As explained in the foregoing, the two-dimensional array of quantised transform coefficients produced by transform coding and quantising a block of luminance/chrominance data (INTRA-coding mode) or prediction error data (INTER-coding mode) is first scanned using a zigzag scanning scheme to form an ordered one-dimensional array. A typical scanning order for a 4×4 array of coefficient values is illustrated in FIG. 4. It will be apparent to those skilled in the art that variations in the exact nature of the zigzag scanning order are possible. Furthermore, similar zigzag scanning schemes may also be applied to arrays of other than 4×4 coefficient values.
The ordered one-dimensional array produced as a result of the zigzag scanning operation is then examined and each non-zero coefficient is represented by a run value and a level value. As previously explained, the run value represents the number of consecutive zero coefficients preceding the coefficient in question. It thus provides an indication of the position of the non-zero coefficient in the scan. The level value is the coefficient's value. An End-Of-Block (EOB) symbol, typically a level value equal to zero, is used indicate that there are no more non-zero coefficients in the block.
In an alternative scheme, each non-zero coefficient is represented by 3 values (run, level, last). In this representation, the level and run parameters serve the same purpose as explained in the previous paragraph. The last parameter indicates that there are no more non-zero coefficients in the scan. When this representation of the coefficients is used, a separate syntax element is used to indicate that a given block is coded and therefore there is no need for separate EOB symbol.
For the purposes of entropy coding, each (run, level) pair (or (run, level, last) triplet) is typically treated as a single symbol. Thus, VLC codewords are assigned to the different possible (run, level) pairs. A unique codeword is also assigned to the EOB symbol. Commonly, the mapping between the possible (run, level) pairs and the VLC codewords is implemented in the form of a fixed look-up table, known to (e.g. stored in) both the encoder and decoder. The VLC codewords are used to convert the symbols to a binary representation which is transmitted to the decoder and are designed in such a way as to be uniquely decodable. In practical terms this means that no VLC codeword may be the prefix for another codeword.
Table 1 is a look-up table of the type just described, showing an exemplary mapping between specific (run, level) pairs and VLC codewords. In the example presented in Table 1 the EOB symbol is assigned the shortest codeword.
TABLE 1Mapping between (run, length) pairs and VLC codewordsVLCVLCRunLevelindexcodeword—EOB010110010−12011113000011−1400011215010012−160101102700000010−28000001131900010013−1100001011411101000014−1120100011. . .. . .. . .. . .
FIG. 5 shows an example of a 4×4 array of quantised transform coefficients, such as that generated in a video encoder for an image block in INTRA-codng mode or a block of prediction error values in INTER-coding mode. After applying the zigzag scanning scheme shown in FIG. 4, the ordered one-dimensional sequence of quantised coefficients thus produced has the following elements:    0, 1, 2, 0, 0, 0, −1, 0, 0, 0, 0, 0, 00, 0, 0
This sequence can further be represented as the following set of (run, level) pairs terminated with an EOB symbol:    (1,1), (0,2), (3,−1), EOB.
Applying the mapping between (run, level) pairs and VLC codewords given in Table 1, the following sequence of bits is generated:    00001|0000001|0001011|1
As mentioned above, this is the binary representation of the quantised transform coefficients transmitted in the bit-stream from the encoder to the decoder. In order to correctly decode the bit-stream, the decoder is aware of the mapping between VLC codewords and the (run, level) pairs. In other words, both encoder and decoder use the same set of VLC codewords and the same assignment of symbols to VLC codewords.
In order to maximise the compression provided by variable length coding, those symbols which occur most frequently in the data to be coded should be assigned the shortest VLC codewords. However, in image coding, the frequency of occurrence (i.e. probability) of different transform coefficients and hence the probability of different (run, level) pairs changes depending on the image content and the type of the encoded image. Thus, if a single set of variable length codewords is used and only a single mapping between the data symbols to be encoded/decoded and the VLCs is provided, in general, optimum coding efficiency cannot be achieved.
One solution to this problem is to transmit the variable length codewords and their assignment to the different data symbols as a part of the bit-stream. This possibility is included in the international still image compression standard ISO/IEC 10918-1 “Digital Compression and Coding of Continuous-Tone Still Images”/ITU-T recommendation T.81 developed by the Joint Photographic Expert Group and commonly referred to as the JPEG image coding standard. If this option is employed, the probabilities of different data symbols, for example the probabilities of different (run, level) pairs, are calculated for each image to be coded. This information is then used to create the VLC codewords and to define the mapping between the data symbols and the codewords. The codewords and the mapping information are, for example, included in the compressed file for a given image and are transmitted in the bit-stream from the encoder to the decoder. This solution allows the codewords and the mappings between the codewords and the data symbols to be constructed in a way that is adaptive to the nature/content of the image to be coded. In this way a level of data compression can be achieved which generally exceeds that which could be attained if fixed codewords and mappings were used. However, this approach has a number of technical disadvantages, which make it unsuitable for use in video applications. More specifically, a significant delay is introduced, as each image, or each part thereof, requires pre-processing before any of the image data can be encoded and transmitted. Furthermore, a large number of bits required to specify information about the variable length codewords and their assignment to the data symbols. Additionally, error resilience is a significant problem. If information relating to the codewords, or the mapping between the codewords and the data symbols, is lost or has residual errors after undergoing error correction at the decoder, the bit-stream comprising the encoded image data cannot be decoded correctly.
In an alternative technique aimed at improving the data compression provided by variable length coding, known as adaptive VLC coding, initial VLC codes and mappings are calculated in both the encoder and the decoder based on a priori symbol probability estimates. In image coding applications these probability estimates may be calculated in advance, for example using a database of so-called ‘training’ images representative/typical of those to be encoded and transmitted. Subsequently, the symbol probability estimates are updated in the encoder and decoder as further encoded data symbols are transmitted. Using the updated probability estimates the encoder and decoder re-calculate the VLC codewords and their assignments. This re-calculation may be performed very frequently, for example after receiving each new symbol. The main drawbacks of this method are high computational complexity (particularly if the probability estimates are re-calculated very frequently) and poor error resilience. Incorrect decoding of one symbol causes a mismatch between the encoder and decoder symbol counts causing the VLC codes designed in the encoder and decoder to differ from that point onwards. This means that the probability counts should be reset at frequent intervals and this tends to decrease the coding efficiency achieved by using this method.
As previously mentioned, modem video coding systems typically provide more than one method of entropy coding. For example, ITU-T recommendation H.26L, as described in G. Bjontegaard, “H.26L Test Model Long Term Number 8 (TML-8) draft 0”, VCEG-N10, June 2001, section 5, provides two alternative methods/modes of entropy coding. The first, default, method is based on variable length coding and the other is a form of arithmetic coding known as context-based binary arithmetic coding (or CABAC for short).
The variable length coding mode of H.26L provides a number of tables specifying VLC codewords and their assignment to data symbols. In the encoder, the particular table selected for use depends on the type of information to be encoded and transmitted. For example, separate VLC lookup tables are provided for the coding of data symbols (e.g. (run, level) pairs) associated with different types of coded image blocks (e.g. INTRA-coded (I) or INTER-coded (P) type blocks), different components of the colour model (luminance or chrominance components) or different values of quantisation parameter (QP). This approach offers a good trade-off between computational complexity and compression efficiency. However, its performance depends on how well the parameters used to switch between the tables characterise the statistical properties of the data symbols.
The context-based binary arithmetic coding mode of H.26L takes advantage of the inherently adaptive nature of arithmetic coding and generally provides improved compression efficiency compared with the default VLC coding mode. However, it has comparatively high computational complexity and its use in error prone environments is problematic. Specifically, it suffers technical shortcomings relating to the loss of synchronisation between encoder and decoder which can arise if transmission errors cause incorrect decoding of part of a codeword. Furthermore, the computational complexity of the CABAC method adopted in the H.26L recommendation is especially high on the decoder side where the time taken for symbol decoding may represent a large fraction of the total decoding time.
Because of the inherent problems of high computational complexity and sensitivity to transmission errors associated with arithmetic coding, variable length coding is still viewed as a powerful and efficient method of entropy coding for use in video coding systems. However, there is still a desire and need to improve the adaptability of VLC coding schemes to the type and statistical properties of the data symbols to be coded so that a high degree of data compression can be achieved consistently. This gives rise to a technical problem concerning the way in which improved adaptability and compression efficiency can be achieved without giving rise to a significant increase in computational complexity or sensitivity to transmission errors.