The invention relates to television systems, in particular the high-resolution systems known as high definition television (HDTV).
A television broadcast consists of a sequence of still frames displayed in rapid succession. The frame rate necessary to achieve proper motion rendition is usually high enough that there are only small variations from one frame to the next (i.e., there is a great deal of temporal redundancy among adjacent frames). Much of the variation between adjacent frames is due to object motion.
A known technique for taking advantage of this limited variation between frames is known as motion-compensated image coding. In such coding, the current frame is predicted from the previously encoded frame using motion estimation and compensation, and the difference between the actual current frame and the predicted current frame is coded. By coding only the difference, or residual, rather than the image frame, itself, it is possible to improve image quality, for the residual tends to have lower amplitude than the image, and can thus be coded with greater accuracy.
Motion estimation and compensation are discussed in Lim, J. S., Two-Dimensional Signal and Image Processing, Prentice Hall, pp. 497-507 (1990). A frame of estimated motion vectors is produced by comparing the current and previous frames. Typically, each motion vector is simply a pair of x and y values representing estimates of the horizontal and vertical displacement of the image from one frame to the next at a particular location. The motion vectors are coded as side information. In the decoder, the current image frame is computed by summing the decoded residual with a motion-compensated version of the prior image frame. Motion compensation is typically performed on each pixel of the prior frame using bilinear interpolation between nearest motion vectors.
In a motion compensated television system, some means has to be provided for initializing the television receiver, as otherwise it has no starting point from which to construct frames from the received residuals. One technique for initialization is to periodically (e.g., once per second) transmit an original image; the receiver simply waits until it receives an original image before providing a display. Another technique is to use as the predictor not the previous frame but only, say, 98% of the previous frame. This causes the residual to contain 2% of the original image (a so-called "leakage factor"), with the result that the receiver will initialize itself over a short period of time (e.g., a one-second time constant). In a television receiver that uses either of these techniques, there can be a noticeable delay before an image is available following a change of channel. Furthermore, injecting even 2% of the original image into the residual can significantly degrade performance because the energy of the residual can be substantially increased, thus taking away from the increased accuracy by which the residual can be transmitted.
A related difficulty in motion compensated television systems is handling scene changes. In a typical scene change, there is little correlation between the current and previous frames, and thus motion estimation and compensation is not effective. A known technique for dealing with scene changes is simply to rely on the motion estimator to decide, on a block-by-block basis, whether the differences between adjacent image frames is so large as not to perform motion compensation for that block of the image. In this way, scene changes are handled using the same local, block-by-block decisions that are used for dealing with other situations in which motion compensation fails locally (e.g., rapidly moving objects that exceed the dynamic range of the motion estimator). When it is determined that a block is not to be motion compensated but is instead to be sent as an original image, information indicating such treatment is sent in place of the motion vector for that block. The decoder in the receiver initializes that block with the received pixels for the block instead of doing a motion compensated prediction for the block.
There are two principal techniques for coding images: waveform coding, in which intensity values are directly coded, and transform coding, in which the image frame is transformed to a domain significantly different from the image intensity domain, and the resulting transform "coefficients" are encoded. Transform coding is discussed in Lim, J. S., Two-Dimensional Signal and Image Processing, Prentice Hall, pp. 642-656 (1990). Typically, the image is divided into a plurality of blocks, and each block is separately transformed. A transform in common use is the discrete cosine transform (DCT). Objectionable "blocking" artifacts can occur in transform-coded images, particularly in those encoded with DCT. Alternatives such as the lapped orthogonal transform (LOT), in which blocks overlap, have been tried in an effort to mitigate such "blocking" artifacts.
One waveform coding technique is subband coding, in which the image is typically filtered by a bank of bandpass filters, each of essentially the same bandwidth. Each filtered image represents a different spatial frequency band. The filtered images are subsampled equally (in view of the equal bandwidths of the filters), with the result that the collection of filtered, subsampled "images" together occupy the same number of pixels as the original image.
It was shown in Baylon, D. M. and Lim, J. S., "Transform/Subband Analysis and Synthesis of Signals," pp. 540-544, 2nd Int. Symp. on Signal Processing and its Applications, Gold Coast, Australia (Aug. 24-30, 1990) that transform coding and subband coding are mathematically equivalent. The transform coefficients in block i,j of a transform frame can be made to correspond to the i,j values within each of the subbands in the subband frame by choosing the bandpass filters and transform operations consistently.
A variation on subband coding is Laplacian pyramid coding, as discussed in Lim, J. S., Two-Dimensional Signal and Image Processing, Prentice Hall, pp. 632-640 (1990). The original image f.sub.0 (FIG. 4A) is successively lowpass filtered and subsampled, to produce a "pyramid" of successively lower frequency, subsampled images, e.g., f.sub.1, f.sub.2, f.sub.3, and f.sub.4 shown in FIG. 4A. The lowest frequency images have relatively fewer values, but the total number of values is greater than the number of pixels in the original image. In Laplacian pyramid coding, there is generated a difference "image", or high-frequency residual, e.sub.k, consisting of the difference between the original image, f.sub.k and a predicted version of the original image, produced by interpolating the next lower band image f.sub.k+1. The coded representation of the image consists of the series of difference "images" e.sub.0, e.sub.1, e.sub.2, and e.sub.3 and the lowest-subsampled image, f.sub.4. At the decoder, the original image is rebuilt by starting with the lowest-subsampled image f.sub.4, and the adjoining difference e.sub.3, to create a prediction of the next higher subsampled image f.sub.3, and the process is repeated until a prediction of f.sub.0 is generated. Such pyramid coding can lead to lower bit rates, but the total number of values used for representation of the original image is greater than the number of pixels in the original image.
Another approach to pyramidal coding of images uses variable bandwidths for the subbands to produce the same number of values as there are pixels in the image. Adelson, Edward H., Simoncelli, Eero, and Hingorani, Rajesh, Orthogonal pyramid transforms for image coding, In Proceedings of SPIE, Oct. 1987. Three high-frequency subbands are transmitted, each occupying one-fourth of the number of values as there are pixels in the original image. One subband contains high-frequency-vertical and high-frequency-horizontal information, and the other two contain low-frequency-vertical/high-frequency-horizontal and high-frequency-vertical/low-frequency-horizontal information. The remaining one-fourth of the values are similarly divided into narrower-bandwidth subbands; three of the subbands contain higher frequency information and occupy three-fourths of the remaining values; the remaining one-fourth is further subdivided in the same manner.
All of the discussion so far has not made a distinction between luminance and chrominance components of the television signal. Typically, there are three components transmitted--Y, I, and Q. The Y, or luminance, component ("luma") represents the intensity of the image. The I and Q, or chrominance, components ("chroma") represent the color of the image. Higher resolution is normally reserved for the luma (e.g., about 85% of the bit rate), because the eye is ordinarily tolerant of high spatial frequency errors in the chroma. Chroma is normally filtered and subsampled (e.g., by a factor 2.times.2 to 4.times.4), to eliminate the highs, to which the eye is not normally sensitive. This works well for natural images, but tends to fail for slowly-moving text and similar images. The low resolution of the chroma tends to produce undesirable artifacts such as "bleeding" of colors at the character edges. Text, graphics, synthetic imagery, and other high-resolution source material will likely be important sources of material for HDTV systems, and subsampling chroma will introduce inherent degradations.
Known techniques for reducing the number of bits to be transmitted are runlength-amplitude representation and statistical coding. Runlength-amplitude representation takes advantage of the fact that there are typically long strings of zeros in coded images, particularly when motion compensation is used, and what is coded is the residual between the actual image and a motion-compensated prediction of the image.
Statistical coding (e.g., Huffman coding) relies on creation of a "codebook" relating possible transmitted signal values to the strings of bits that will represent them in the transmitted signal. To reduce, on average, the number of bits to be transmitted, the signal values most frequently transmitted are assigned to the shortest bit strings, and longer strings are used for less likely signal values, so that the length of the bit string is inversely related to the likelihood of occurrence of the signal value being transmitted.
Runlength-amplitude representation and statistical coding have been applied to transmission of transform-coded images. Each block of a DCT-transform-coded image is scanned to produce runlength-amplitude pairs, with one number of each pair representing the length of the string of zeros and the other number representing the non-zero value. A Huffman codebook is developed based on expected statistics of all such runlength-amplitude pairs, and the same codebook is used repeatedly for each block of the image.
Using statistical coding complicates the coding process in that it produces variability in the number of bits to be transmitted across what is ordinarily a fixed capacity channel. The conventional solution is to provide a large buffer (e.g., 10-20 frames in size) with feedback to the quantizer. As the buffer fills, the quantizer is made more coarse; this reduces the entropy of the quantizer output and avoids overflow. Similarly, as the buffer empties, the quantizer is made more fine. Some care is required to ensure stability and to ensure that the buffer can never overflow or underflow ("last-ditch" quantizer modes and bit-stuffing, respectively).