This invention relates generally to encoding and transcoding multiple video objects, and more particularly to a system that controls the encoding and transcoding of multiple video objects with variable temporal resolutions.
Recently, a number of standards have been developed for communicating encoded information. For video sequences, the most widely used standards include MPEG-1 (for storage and retrieval of moving pictures), MPEG-2 (for digital television) and H.263, see ISO/IEC JTC1 CD 11172, MPEG, xe2x80x9cInformation Technologyxe2x80x94Coding of Moving Pictures and Associated Audio for Digital Storage Media up to about 1.5 Mbit/sxe2x80x94Part 2: Coding of Moving Pictures Information,xe2x80x9d 1991, LeGall, xe2x80x9cMPEG: A Video Compression Standard for Multimedia Applications,xe2x80x9d Communications of the ACM, Vol. 34, No. 4, pp. 46-58, 1991, ISO/IEC DIS 13818-2, MPEG-2, xe2x80x9cInformation Technologyxe2x80x94Generic Coding of Moving Pictures and Associated Audio Informationxe2x80x94Part 2: Video,xe2x80x9d 1994, ITU-T SG XV, DRAFT H.263, xe2x80x9cVideo Coding for Low Bitrate Communication,xe2x80x9d 1996, ITU-T SG XVI, DRAFT13 H.263+Q15-A-60 rev.0, xe2x80x9cVideo Coding for Low Bitrate Communication,xe2x80x9d 1997.
These standards are relatively low-level specifications that primarily deal with the spatial and temporal compression of video sequences. As a common feature, these standards perform compression on a per frame basis. With these standards, one can achieve high compression ratios for a wide range of applications.
Newer video coding standards, such as MPEG-4 (for multimedia applications), seexe2x80x9cInformation Technologyxe2x80x94Generic coding of audio/visual objects,xe2x80x9d ISO/IEC FDIS 14496-2 (MPEG4 Visual), Nov. 1998, allow arbitrary-shaped objects to be encoded and decoded as separate video object planes (VOP). The objects can be visual, audio, natural, synthetic, primitive, compound, or combinations thereof. Video objects are composed to form compound objects or xe2x80x9cscenes.xe2x80x9d
The emerging MPEG-4 standard is intended to enable multimedia applications, such as interactive video, where natural and synthetic materials are integrated, and where access is universal. MPEG-4 allows for content based interactivity. For example, one might want to xe2x80x9ccut-and-pastexe2x80x9d a moving figure or object from one video to another. In this type of application, it is assumed that the objects in the multimedia content have been identified through some type of segmentation process, see for example, U.S. patent application Ser. No. 09/326,750 xe2x80x9cMethod for Ordering Image Spaces to Search for Object Surfacesxe2x80x9d filed on Jun. 4, 1999 by Lin et al.
In the context of video transmission, these compression standards are needed to reduce the amount of bandwidth (available bit rate) that is required by the network. The network can represent a wireless channel or the Internet. In any case, the network has limited capacity and a contention for its resources must be resolved when the content needs to be transmitted.
Over the years, a great deal of effort has been placed on architectures and processes that enable devices to transmit the video content robustly and to adapt the quality of the content to the available network resources. Rate control is used to allocate the number of bits per coding time instant. Rate control ensures that the bitstream produced by an encoder satisfies buffer constraints.
Rate control processes attempt to maximize the quality of the encoded signal, while providing a constant bit rate. For frame-based encoding, such as MPEG-2, see U.S. Pat. No. 5,847,761, xe2x80x9cMethod for performing rate control in a video encoder which provides a bit budget for each frame while employing virtual buffers and virtual buffer verifiers,xe2x80x9d issued to Uz, et al. on Dec. 8, 1998. For object-based encoding, such as MPEG-4, see U.S. Pat. No. 5,969,764, xe2x80x9cAdaptive video coding method,xe2x80x9d issued to Sun and Vetro on Oct. 19, 1999.
When the content has already been encoded, it is sometimes necessary to further convert the already compressed bitstream before the stream is transmitted through the network to accommodate, for example, a reduction in the available bit rate. Bit stream conversion orxe2x80x9ctranscodingxe2x80x9d can be classified as bit rate conversion, resolution conversion, and syntax conversion. Bit rate conversion includes bit rate scaling and conversion between a constant bit rate (CBR) and a variable bit rate (VBR). The basic function of bit rate scaling is to accept an input bitstream and produce a scaled output bitstream that meets new load constraints of a receiver. A bit stream scaler is a transcoder, or filter, that provides a match between a source bitstream and the receiving load.
As shown in FIG. 1, typically, scaling can be accomplished by a transcoder 100. In a brute force case, the transcoder includes a decoder 110 and encoder 120. A compressed input bitstream 101 is fully decoded at an input rate Rin, then encoded at a new output rate Rout 102 to produce the output bitstream 103. Usually, the output rate is lower than the input rate. However, in practice, full decoding and full encoding in a transcoder is not done due to the high complexity of encoding the decoded bitstream, instead the transcoding is done on a compressed or partial decoded bitstream.
Earlier work on MPEG-2 transcoding has been published by Sun et al., in xe2x80x9cArchitectures for MPEG compressed bitstream scaling,xe2x80x9d IEEE Transactions on Circuits and Systems for Video Technology, April 1996. There, four methods of rate reduction, with varying complexity and architecture, were presented.
FIG. 2 shows an example method. In this architecture, the video bitstream is only partially decoded. More specifically, macroblocks of the input bitstream 201 are variable-length decoded (VLD) 210. The input bitstream is also delayed 220 and inverse quantized (IQ) 230 to yield discrete cosine transform (DCT) coefficients. Given the desired output bit rate, the partially decoded data are analyzed 240 and a new set of quantizers is applied at 250 to the DCT macroblocks. These re-quantized macroblocks are then variable-length coded (VLC) 260 and a new output bitstream 203 at a lower rate can be formed. This scheme is much simpler than the scheme shown in FIG. 1 because the motion vectors are re-used and an inverse DCT operation is not needed.
More recent work by Assuncao et al., in xe2x80x9cA frequency domain video transcoder for dynamic bit-rate reduction of MPEG-2 bitstreams,xe2x80x9d IEEE Transactions on Circuits and Systems for Video Technology, pp. 953-957, December 1998, describe a simplified architecture for the same task. They use a motion compensation (MC) loop, operating in the frequency domain for drift compensation. Approximate matrices are derived for fast computation of the MC macroblocks in the frequency domain. A Lagrangian optimization is used to calculate the best quantizer scales for transcoding.
Other work by Sorial et al, xe2x80x9cJoint transcoding of multiple MPEG video bitstreams,xe2x80x9d Proceedings of the International Symposium on Circuits and Systems, May 1999, presents a method of jointly transcoding multiple MPEG-2 bitstreams, see also U.S. patent application Ser. No. 09/410,552 xe2x80x9cEstimating Rate-Distortion Characteristics of Binary Shape Data,xe2x80x9d filed Oct. 1, 1999 by Vetro et al.
According to prior art compression standards, the number of bits allocated for encoding texture information is controlled by a quantization parameter (QP). The above papers are similar. Changing the QP on the basis of information contained in the original bitstream reduces the rate of texture bits. For an efficient implementation, the information is usually extracted directly in the compressed domain and can include measures that relate to the motion of macroblocks or residual energy of DCT macroblocks. This type of analysis can be found in the bit allocation analyzer 240 of FIG. 2.
In addition to the above classical methods of transcoding, some new methods of transcoding have been described, for example, see U.S. patent application Ser. No. 09/504,323 xe2x80x9cObject-Based Bitstream Transcoder,xe2x80x9d filed by Vetro et al. on Feb. 14, 2000, for example. There, information delivery systems that overcome limitations of conventional transcoding systems were described. The conventional systems were somewhat bounded in the amount of rate that could be reduced, and also the conventional systems did not consider the overall perceptual quality; rather, objective measures, such as PSNR have dominated.
In the systems described by Vetro, et al., conversion is more flexible and the measure of quality can deviate from classical bit-by-bit differences.
Vetro summarizes video content in very unique ways. Within the object-based framework, individual video objects are transcoded with different qualities. The difference in quality can be related to either the spatial quality or the temporal resolution (quality).
If the temporal resolution is varied among objects in a scene, it is important that all objects maintain some type of temporal synchronization with each other. When temporal synchronization is maintained, the receiver can compose the objects so that all pixels within a reconstructed scene are defined.
Undefined pixels in the scene can result from background and foreground objects, or overlapping objects being sampled at different temporal resolutions so that in the re-composed scene, xe2x80x9cholesxe2x80x9d appear. Therefore, when varying the temporal resolution of multiple objects during encoding or transcoding, it was critical that synchronization was maintained.
To illustrate this point further. Consider a scene where there is a relatively stationary background object, e.g., a blank wall, and a more active foreground object such as moving person. The background can be encoded at a relatively low temporal resolution; say ten frames per second. The foreground object is encoded at a higher temporal resolution of thirty frames per second. This is fine as long as the foreground object does not move a lot. However, should the foreground object move with respect o the background, a xe2x80x9cholexe2x80x9d will appear in that portion of the background, which is no longer occluded by the foreground object.
It is an object of the invention to correct this problem and to enable encoding and transcoding of multiple video objects with variable temporal resolutions.
The most recent standardization effort taken on by the MPEG standard committee is that of MPEG-7, formally called xe2x80x9cMultimedia Content Description Interface,xe2x80x9d see xe2x80x9cMPEG-7 Context, Objectives and Technical Roadmap,xe2x80x9d ISO/IEC N2861, July 1999. Essentially, this standard plans to incorporate a set of descriptors and description schemes that can be used to describe various types of multimedia content. The descriptor and description schemes are associated with the content itself and allow for fast and efficient searching of material that is of interest to a particular user. It is important to note that this standard is not meant to replace previous coding standards, rather, it builds on other standard representations, especially MPEG-4, because the multimedia content can be decomposed into different objects and each object can be assigned a unique set of descriptors. Also, the standard is independent of the format in which the content is stored.
The primary application of MPEG-7 is expected to be search and retrieval applications, see xe2x80x9cMPEG-7 Applications,xe2x80x9d ISO/IEC N2861, July 1999. In a simple application, a user specifies some attributes of a particular object. At this low-level of representation, these attributes can include descriptors that describe the texture, motion and shape of the particular object. A method of representing and comparing shapes has been described in U.S. patent application Ser. No. 09/326,759 xe2x80x9cMethod for Ordering Image Spaces to Represent Object Shapesxe2x80x9d filed on Jun. 4, 1999 by Lin et al., and a method for describing the motion activity has been described in U.S. patent application Ser. No. 09/406,444 xe2x80x9cActivity Descriptor for Video Sequencesxe2x80x9d filed on Sep. 27, 1999 by Divakaran et al. To obtain a higher-level of representation, one can consider more elaborate description schemes that combine several low-level descriptors. In fact, these description schemes can even contain other description schemes, see xe2x80x9cMPEG-7 Multimedia Description Schemes WD (V1.0),xe2x80x9d ISO/IEC N3113, December 1999 and U.S. patent application Ser. No. 09/385,169 xe2x80x9cMethod for representing and comparing multimedia content,xe2x80x9d filed Aug. 30, 1999 by Lin et al.
These descriptors and description schemes allow a user to access properties of the video content that are not traditionally derived by an encoder or transcoder. For example, these properties can represent look-ahead information that was assumed to be inaccessible to the transcoder. The only reason that the encoder or transcoder has access to these properties is because the properties were extracted from the content at an earlier time, i.e., the content was pre-processed and stored in a database with its associated meta-data.
The information itself can be either syntactic or semantic, where syntactic information refers to the physical and logical signal aspects of the content, while the semantic information refers to the conceptual meaning of the content. For a video sequence, the syntactic elements can be related to the color, shape and motion of a particular object. On the other hand, the semantic elements can refer to information that cannot be extracted from low-level descriptors, such as the time and place of an event or the name of a person in a video sequence.
It is desired to maintain synchronization in an object-based encoder or transcoder for video objects in a scene having variable temporal resolutions. Moreover, it is desired that such variation is identified with video content meta-data.
The present invention provides an apparatus and method for coding a video. The coding according to the invention can be performed by an encoder or a transcoder. The video is first partitioned into video objects. In the case of the encoder, the partitioning is done with segmentation planes, and in the case of the transcoder, a demultiplexer is used. Over time, shape features are extracted from each object. The shape features can be obtained by measuring how the shape of each object evolves over time. A Hamming or Hausdorff distance measure can be used. The extracted shape features are combined in a rate or transcoder control unit to determine a temporal resolution for each object over time. The temporal resolutions are used to encode the various video objects. Optionally, motion features and coding complexity can also be considered while making trade-offs in temporal resolution determinations.
In the case where the video is uncompressed data, the partitioning, combining, and coding is performed in an encoder. For a compressed video, the demultiplxing, combining, and coding are performed in a transcoder. In the later case, boundary blocks of the objects in the compressed-video are used for extracting the shape features. In one aspect of the invention, different objects can have different temporal resolutions or frame rates.