1. Field of the Invention
This invention relates to coding of digital video signals using mesh or wireframe modeling. More particularly, the invention relates to a coding scheme that codes video data as a base layer of coded data and a second, supplementary layer of mesh node coded data. The mesh node coding permits decoders to apply enhanced functionalities to elements of the video image.
2. Related Art
Video coding techniques are known. Typically, they code video data at a first data rate down to a second, lower, data rate. Typically, such coding is necessary to transmit the video information through a channel, which may be a radio channel, a data link of a computer network, or a storage element such as an optical or magnetic memory. Video coding reduces the capacity requirements of channels and permits the video information to be reconstructed at a decoder for display or manipulation.
Different coding applications have different objectives. Some desire only to code and decode video data. Others, however, particularly those that code synthetic video data, desire to attach functionalities to the video. Functionalities may include: motion tracking of moving objects, temporal interpolation of objects, modification of video objects (such as warping an image upon a video object), manipulation of size, orientation or texture of objects in a scene. Often, such operations are needed to be performed on individual objects in a scene, some of which may be synthetic others of which are natural.
One proposed standard for video coding has been made in the MPEG-4 Video Verification Model Version 5.1, ISO/IEC JTC1/ISC29/WG11 N1469 Rev., December 1996(xe2x80x9cMPEG-4, V.M. 5.1xe2x80x9d). According to MPEG-4, V.M. 5.1, encoders identify xe2x80x9cvideo objectsxe2x80x9d from a scene to be coded. Individual frames of the video object are coded as xe2x80x9cvideo object planesxe2x80x9d or VOPs. The spatial area of each VOP is organized into blocks or macroblocks of data, which typically are 8 pixel by 8 pixel (blocks) or 16 pixel by 16 pixel (macroblocks) rectangular areas. A macroblock typically is a grouping of four blocks. For simplicity, reference herein is made to blocks and xe2x80x9cblock based codingxe2x80x9d but it should be understood that such discussion applies equally to macroblocks and macroblock based coding. Image data of the blocks are coded by an encoder, transmitted through a channel and decoded by a decoder.
Under MPEG4, V.M. 5.1 coding, block data of most VOPs are not coded individually. Shown in FIG. 1A, image data of a block from one VOP may be used as a basis for predicting the image data of a block in another VOP. Coding first begins by coding an initial VOP, an xe2x80x9cI-VOPxe2x80x9d, without prediction. However, the I-VOP data may be used to predict data of a second VOP, a xe2x80x9cP-VOPxe2x80x9d. Blocks of the second VOP are coded based on differences between the actual data and the predicted data from blocks of the I-VOP. Finally, image data of a third type of VOP may be predicted from two previously coded VOPs. The third VOP is a xe2x80x9cbidirectional VOPxe2x80x9d or B-VOP. As is known, the B-VOP typically is coded after the I-VOP and P-VOP are coded. However, the different types of VOPs may be (and typically are) coded in an order that is different than the order in which they are displayed. Thus, as shown in FIG. 1A, the P-VOP is coded before the B-VOP even though it appeared after the B-VOP. Other B-VOPs may appear between the I-VOP and the P-VOP.
Where prediction is performed (P-VOP and B-VOP), image data of blocks are coded as, motion vectors and residual texture information. Blocks may be thought to xe2x80x9cmovexe2x80x9d from frame to frame (VOP to VOP). Thus, MPEG-4 codes motion vectors for each block. The motion vector, in effect, tells a decoder to predict the image data of a current block by moving image data of blocks from one or move previously coded VOPs to the current block. However, because such prediction is imprecise, the encoder also transmits residual texture data representing changes that must be made to the predicted image data to generate accurate image data. Encoding of image data using block based motion vectors and texture data is known as xe2x80x9cmotion compensated transform encoding.xe2x80x9d
Coding according to the MPEG-4 V.M. 5.1 is useful to code video data efficiently. Further, it provides for relatively simple decoding, permitting viewers to access coded video data with low-cost, low-complexity decoders. The coding proposal is limited, however, because it does not provide for functionalities to be attached to video objects.
As the MPEG-4, V.M. 5.1 coding standard evolved, a proposal was made to integrate functionalities. The proposed system, a single layer coding system, is shown in FIG. 1B. There, video data is subject to two types of coding According to the proposal, texture information in VOPs is coded on a block basis according to motion compensated transform encoding. Motion vector information would be coded according to a different technique, mesh node motion encoding. Thus, encoded data output from an encoder 110 includes block based texture data and mesh node based motion vectors.
Mesh node modeling is a well known tool in the area of computer graphics for generating synthetic scenes. Mesh modeling maps artificial or real texture to wireframe models and may provide animation of such scenes by moving the nodes or node sets. Thus, in computer graphics, mesh node modeling represents and animates synthetic content. Mesh modeling also finds application when coding natural scenes, such as in computer vision applications. Natural image content is captured by a computer, broken down into individual components and coded via mesh modeling. As is known in the field of synthetic video, mesh modeling provides significant advantages in attaching functionalities to video objects. Details of known mesh node motion estimation and decoding can, be found in: Nakaya, et al., xe2x80x9cMotion Compensation Based on Spatial Transformations,xe2x80x9d IEEE Trans. Circuits and Systems for Video Technology, pp. 339-356, June 1994; Tekalp, et al., xe2x80x9cCore experiment M2: Updated description,xe2x80x9d ISO/IEC JTC1/SC29/WG11 MPEG96/1329, September 1996; and Tekalp, et al., xe2x80x9cRevised syntax and results for CE M2(Triangular mesh-based coding),xe2x80x9d ISO/IEC JTC1/SC29/WG11 MPEG96/1567, November 1996.
A multiplexer 120 at the encoder merges the data with other data necessary to provide for complete encoding such as administrative overhead data, possibly audio data or data from other video objects. The merged coded data is output to the channel 130. A decoder includes a demultiplexer 140 and a VOP decoder 150 that inverts the coding process applied at the encoder. The texture data and motion vector data of a particular VOP are decoded by the decoder 150 and output to a compositor 160. The compositor 160 assembles the decoded information with other data to form a video data stream for display.
By coding image motion according to mesh node notation, the single layer system of FIG. 1B permits decoders to apply functionalities to a decoded image. However, it also suffers from an important disadvantage: All decoders must decode mesh node motion vectors. Decoding of mesh node motion vectors is computationally more complex than decoding of block based motion vectors. The decoders of the system of FIG. 1B are more costly because they must meet higher computational requirements. Imposing such cost requirements is disfavored, particularly for general purpose coding protocols where functionalities are used in a limited number of coding applications.
Thus, there is a need in the art for a video coding protocol that permits functionalities to be attached to video objects. Further, there is a need for such a coding protocol that is inter-operable with simple decoders. Additionally, there is a need for such a coding protocol that provides coding for the functionalities in an efficient manner.
The disadvantages of the prior art are alleviated to a great extent by a method and apparatus for coding video data as base layer data and enhancement layer data. The base layer data includes convention motion compensated transform encoded texture and motion vector data. Optional enhancement layer data contains mesh node vector data. Mesh node vector data of the enhancement layer may be predicted based on motion vectors of the base layer. Thus simple decoders may decode the base layer data and obtain a basic representation of the coded video data. However, more powerful decoders may decode both the base layer and enhanced layer to obtain decoded video permitting functionalities.
An embodiment of the present invention provides a back channel that permits a decoder to affect how mesh node coding is performed in the encoder. The decoder may command the encoder to reduce or eliminate encoding of mesh node motion vectors. The back channel finds application in single layer systems and two layer systems.