1. Field of the Invention
The present invention relates to the encoding and decoding of scalable digital video signals, and more particularly to a method of scaling of interlaced images for compatible encoding and decoding video signals multiple scanning standards.
2. Description of the Prior Art
Numerous picture bit rate reduction coding schemes are known for compressing digitized video signals for transmission and storage at a reduced bit rate. Activities for developing these techniques are in progress. International standards have been created, while some are still under development, by organizations like the CCITT and ISO.
New coding schemes are required to maintain compatibility with existing coding schemes while including some degree of scalability where the reconstructed video signals can have a multiplicity of spatial resolutions. When a new standard decoder is able to decode pictures from the signal of an existing standard encoder, the scheme is known to be forward compatible. On the other hand, when an existing standard decoder is able to decode pictures from the signal of a new standard encoder, the new scheme is known to be backward compatible. The demand to satisfy both the forward and backward compatibility can be achieved by layered coding.
An encoding system for layered coding essentially consists of multiple layers of encoders coupled to each other. For simplicity, the description hereafter will be concentrated on, but not limited to, a two-layer system where the low layer processes the standard TV video signals (SDTV) while the high layer processes the high definition video signals (HDTV). In an alternative system, the low and high layers may be assigned to process low definition video signals (LDTV) and SDTV signals, respectively. The encoding system receives a sequence of HDTV images which are down-converted to SDTV resolution. The low layer encoder compresses and encodes the down-converted images to produce a low layer data stream. The compressed SDTV signals are locally decoded for use as predictors for the high layer encoder and a subsequent encoding process at the same layer. At the same time, the high layer encoder compresses and encodes the original HDTV images to produce a high layer data stream. Similarly, the compressed HDTV signals are locally decoded for use as predictors for a subsequent encoding process at the same layer. Hence, there are two predictors for the high layer encoder: one comes from the same (high) layer while the other comes from the low layer. The predictor from the same (high) layer, hereafter referred to as a "temporal predictor", is a past or future decompressed picture in the display order. The predictor from the low layer, hereafter referred to as a "compatible predictor", is spatially up-converted to HDTV resolution and used for compatible prediction. Both the temporal and compatible predictors may be used separately or together in a weighted average form. Finally, the low and high layer data streams are multiplexed for transmission and/or storage.
A corresponding decoding system of a two-layer encoding system essentially consists of a low layer decoder and a high layer decoder. The transmitted signal is first demultiplexed to the low and high layer data streams. The low layer data stream is decompressed and decoded by the low layer decoder to produce reconstructed SDTV images. These reconstructed SDTV images are up-converted to HDTV resolution and used as compatible predictors for the high layer decoder. At the same time, the high layer data stream is decompressed and decoded by the high layer decoder, based on the temporal and compatible predictors, to reconstruct the HDTV images. The decoding system can, therefore, produce images at SDTV and HDTV resolutions, allowing some degree of scalability.
Efficient up- and down-conversions are crucial to the layered coding described above, especially when both the SDTV and HDTV images are interlaced. Early technologies adopted intrafield conversion methods where all the lines in an output field were derived from the lines in only one field of the input. For down-conversion, each field of the input interlaced frames is filtered and down-sampled independently, and for up-conversion, each field of the input interlaced frames is interpolated vertically from two or more lines in the same field. It is well recognized that intra-field down-conversion is inadequate for deriving interlaced lower resolution images from higher resolution interlaced source. The problem becomes even worse when the down-converted images are coded and up-converted, based on an intra-field method, and used as a compatible prediction for layered coding.
Further improvements could be achieved by employing temporal interpolation in addition to vertical interpolation, i.e., by using lines from more than one input field in deriving the lines in an output field. Temporal interpolation improves vertical definition on stationary images, but causes very noticeable impairment for scenes with fast movements. One way for solving the problem is to use adaptive system in which the temporal interpolation is used for stationary and very slowly moving scenes and the vertical interpolation is used for scenes with faster movements. An example of the adaptive method could be found in UK Patent GB 2184628.
It was later realized that a non-adaptive spatio-temporal interpolation could perform as well as the adaptive system but with less complexity because no movement detector circuitry is needed. The details of the non-adaptive spatio-temporal interpolation is found in Devereux, V.G, "Standards conversion between 1250/50 and 625/50 TV systems", IBC 92 paper. According to this method, interpolation of one field is derived from a plurality of lines in the same field and those in the adjacent fields which come immediately before and after the target field. This is illustrated in FIG. 1 in which the vertical position of lines is plotted vertically against time horizontally. In the diagram, a pel at line 1e is derived by filtering the pels at lines 1b, 1c, 2a, 2b, 2c, 3a, 3b, 3c. Similarly, a pel at line 2e is derived by filtering the pels at lines 2b, 2c, 1b, 1c, 1d, 4b, 4c, 4d. The weights for filtering the lines in the adjacent fields are always summed to zero. It is important to note that this method involves a three-field aperture filtering which extends to the neighboring frames in the display order.
A coding scheme which adopts "bi-directional prediction", however, does not process the images according to the display order. A bidirectionally predicted picture is coded by referring to both past and future pictures in the display order. The encoding/decoding order of a single layer system is illustrated in FIGS. 2(a)-2(d) in which the position of the frames, each consists of two fields, is plotted against time. In in FIG. 2(a), frames #1 and #2 are predicted from frames #0 and #3. Therefore, frame #3 has to be coded first before coding frames #1 and #2, as depicted in in FIG. 2(b). The re-ordering of the input images demands four frame memories for storing the input images and incurs a three-frame delay between the input and output of the encoder. At the decoder, the reconstructed images have to be re-ordered back to the original order for display, which incurs another three-frame delay. This is shown in case (c) in FIG. 2. When this bi-directional prediction scheme is applied to the two-layer encoding/decoding system which adopts the three-field aperture interpolation method for up- and down-conversion, the frame memories and delay will at least double. The reason comes from the fact that compatible predictors from the low layer have to be re-ordered back to the display order before the three-field aperture interpolation is applied to up-convert the predictors to HDTV resolution. The number of frame memories and delay involved will further grow as the number of layers increases.