The present invention relates to a method and apparatus for providing temporal and spatial scaling of video images including video object planes in a digital video sequence. In particular, a motion compensation scheme is presented which is suitable for use with scaled frame mode or field mode video. A scheme for adaptively compressing field mode video using a spatial transformation such as the Discrete Cosine Transformation (DCT) is also presented.
The invention is particularly suitable for use with various multimedia applications, and is compatible with the MPEG-4 Verification Model (VM) 3.0 standard described in document ISO/IEC/JTC1/SC29/WG11 N1642, entitled "MPEG-4 Video Verification Model Version 7.0", April 1997, incorporated herein by reference. The invention can further provide coding of stereoscopic video, picture-in-picture, preview access channels, and asynchronous transfer mode (ATM) communications.
MPEG-4 is a new coding standard which provides a flexible framework and an open set of coding tools for communication, access, and manipulation of digital audio-visual data. These tools support a wide range of features. The flexible framework of MPEG-4 supports various combinations of coding tools and their corresponding functionalities for applications required by the computer, telecommunication, and entertainment (i.e., TV and film) industries, such as database browsing, information retrieval, and interactive communications.
MPEG-4 provides standardized core technologies allowing efficient storage, transmission and manipulation of video data in multimedia environments. MPEG-4 achieves efficient compression, object scalability, spatial and temporal scalability, and error resilience.
The MPEG-4 video VM coder/decoder (codec) is a block- and object-based hybrid coder with motion compensation. Texture is encoded with an 8.times.8 DCT utilizing overlapped block-motion compensation. Object shapes are represented as alpha maps and encoded using a Content-based Arithmetic Encoding (CAE) algorithm or a modified DCT coder, both using temporal prediction. The coder can handle sprites as they are known from computer graphics. Other coding methods, such as wavelet and sprite coding, may also be used for special applications.
Motion compensated texture coding is a well known approach for video coding. Such an approach can be modeled as a three-stage process. The first stage is signal processing which includes motion estimation and compensation (ME/MC) and a 2-D spatial transformation. The objective of ME/MC and the spatial transformation is to take advantage of temporal and spatial correlations in a video sequence to optimize the rate-distortion performance of quantization and entropy coding under a complexity constraint. The most common technique for ME/MC has been block matching, and the most common spatial transformation has been the DCT. However, special concerns arise for ME/MC and DCT coding of the boundary blocks of an arbitrarily shaped VOP.
The MPEG-2 Main Profile is a precursor to the MPEG-4 standard, and is described in document ISO/IEC JTC1/SC29/WG11 N0702, entitled "Information Technology--Generic Coding of Moving Pictures and Associated Audio, Recommendation H.262,11, " March 25, 1994, incorporated herein by reference. Scalability extensions to the MPEG-2 Main Profile have been defined which provide for two or more separate bitstreams, or layers. Each layer can be combined to form a single high-quality signal. For example, the base layer may provide a lower quality video signal, while the enhancement layer provides additional information that can enhance the base layer image.
In particular, spatial and temporal scalability can provide compatibility between different video standards or decoder capabilities. With spatial scalability, the base layer video may have a lower spatial resolution than an input video sequence, in which case the enhancement layer carries information which can restore the resolution of the base layer to the input sequence level. For instance, an input video sequence which corresponds to the International Telecommunications Union--Radio Sector (ITU-R) 601 standard (with a resolution of 720.times.576 pixels) may be carried in a base layer which corresponds to the Common Interchange Format (CIF) standard (with a resolution of 360.times.288 pixels). The enhancement layer in this case carries information which is used by a decoder to restore the base layer video to the ITU-R 601 standard. Alternatively, the enhancement layer may have a reduced spatial resolution.
With temporal scalability, the base layer can have a lower temporal resolution (i.e., frame rate) than the input video sequence, while the enhancement layer carries the missing frames. When combined at a decoder, the original frame rate is restored.
Accordingly, it would be desirable to provide temporal and spatial scalability functions for coding of video signals which include video object planes (VOPs) such as those used in the MPEG-4 standard. It would be desirable to have the capability for coding of stereoscopic video, picture-in-picture, preview access channels, and asynchronous transfer mode (ATM) communications.
It would further be desirable to have a relatively low complexity and low cost codec design where the size of the search range is reduced for motion estimation of enhancement layer prediction coding of bi-directionally predicted VOPs (B-VOPs). It would also be desirable to efficiently code an interlaced video input signal which is scaled to base and enhancement layers by adaptively reordering pixel lines of an enhancement layer VOP prior to determining a residue and spatially transforming the data. The present invention provides a system having the above and other advantages.