Video compositing is the process of combining multiple video signals, originating from different sources, to produce a single composited scene. Video compositing is commonly performed in multipoint video conferencing, video editing, and multimedia applications.
Typical video compositing operations include processing of individual video objects by general geometrical transformation or filtering, or combining multiple video objects by opaque or semi-transparent overlap. In general, video objects are semi-transparent, arbitrarily-shaped, and arbitrarily-positioned. Different compositing operations have different complexity. Most applications only implement a subset of the compositing operations.
For network video applications, video compositing can be implemented at different locations, including the video source sites, intermediate nodes within or outside networks, and users' display sites. Although video compositing can be implemented in the uncompressed spatial domain, where operations are performed pixel by pixel, most video data transmitted through networks, or stored in video servers, is represented in some compressed format. Therefore, it would be desirable to perform video compositing operations on data which is still in a compressed format.
Many video applications utilize data compression. More particularly, many video applications utilize transform code compressed domain formats (referred to herein as "transform domain" formats), which include the Discrete Cosine Transform (DCT) format, the interframe predictive code format, such as the Motion Compensation (MC) algorithm, which may be used in conjunction with the DCT format, and hybrid compressed formats. The DCT format is used in the compression standard for still images JPEG (Standard Draft, JPEG-9-R7, Feb. 1991). The combination of Motion Compensation and Discrete Cosine Transform compression algorithm (MC/DCT) is used in a number of standards including: the compression standard for motion pictures (MPEG--standard Draft, MPEG Video Committee Draft, MPEG 90/176 Rev. 2, Dec. 1990), the standard for video conferencing (CCITT--Recommendation H.261, Video Codec for Audiovisual Services at px64 kbits/s), and some High Definition Television proposals.
FIG. 1 is a block diagram depicting the processing associated with the DCT format. At the encoder 20, the DCT format is established by DCT formulator 22. Specifically, the video data is segmented into blocks of data, or pixel blocks, which can be represented by an N.times.N matrix, which is referred to as "A". The matrix A of video input data can be transformed to the DCT domain through the following operation: ##EQU1## The C.sup.T term refers to the transposition of the C matrix, such that the columns of C become rows in C.sup.T and the rows in C become the columns in C.sup.T. The elements A.sub.c (i,j) represent the spectrum of the original image block, A, at different spatial frequencies. Elements with larger index values represent the higher-frequency components.
Typically, the DCT coefficients A.sub.c (i,j) are then quantized at quantizer 24. That is, the individual coefficients of the resultant N.times.N matrix are each transformed to a finite number of bits. Usually, the number of bits is a function of the rate and quality requirement of the video service.
Typically, the quantized values are further encoded by the variable length coder 26, which often includes a run-length coding function and an entropy coding function. The run-length coding function uses a single bit length to represent a long run of zero values. The entropy coding function assigns longer bit patterns to samples with a higher probability of occurrence, and shorter bit patterns to samples with a lower probability of occurrence. Consequently, the overall average data amount can be reduced. The variable length coded data is then conveyed to transmission channel 28.
As depicted in FIG. 1, inverse operations are performed at decoder 30 to return to an uncompressed domain. In particular, inverse variable length coding is performed by inverse variable length coder 32, an inverse quantization step is performed by inverse quantizer 34, and an inverse DCT operation is performed by inverse DCT formulator 36. This processing results in the original uncompressed video data.
Usually, the DCT-compressed video signal has much less data than the original spatially uncompressed video signal. Therefore, given multiple DCT-compressed video signals, compositing directly in the DCT-compressed domain is a potentially efficient approach. That is, given DCT-compressed data input video streams, it would be desirable to directly perform compositing operations on the DCT-compressed data, without conversion to the uncompressed spatial domain. However, there are complications associated with transform domain compositing operations.
In many of the compression standards previously mentioned, the DCT algorithm is accompanied by the interframe motion compensation (MC) algorithm which exploits the temporal redundancy in the video sequence. As previously stated, this data format is referred to as MC/DCT. The MC algorithm includes two types of data: motion vector data and error prediction data.
The MC algorithm searches over a fixed-size area, called the motion area, and identifies the optimal reference block in a previous video frame which may be used to construct an image for a block within a present video frame. FIG. 2A depicts the location of a reference block 40 from a previous video frame which may be used to construct a current block 42 in a present frame. The motion vector defines the difference between the position of the reference block 40 and the position of the current block 42. The prediction error defines the difference in image content between the reference block and the current block.
The MC encoding process may be characterized by the following equation: EQU e(t,x,y)=P(t,x,y)-P.sub.rec (t-1,x-d.sub.x,y-d.sub.y
where e(t,x,y) is the prediction error at time t and coordinates (x,y), P is the current image block, P.sub.rec is the reconstructed image, and d is the motion vector.
FIG. 2B illustrates that the reference block 40 may form a portion of four video blocks, namely, B.sub.1, B.sub.2, B.sub.3, and B.sub.4. The problems associated with converting such a reference block to a current block will be discussed below.
FIG. 3 illustrates a problem associated with the MC algorithm when a foreground object 37 partially overlaps a background object 38. Assume that the foreground object is totally opaque and the block boundary positions for these two objects match. The stripped area 39 will be referred to as the "directly affected area" because part of the motion area is directly replaced by the foreground object 37. It is necessary to check every current block in the directly affected area to see if its reference block is replaced by the foreground object 37. If so, we need to recalculate the MC data, including a new motion vector and new prediction errors, of the current block. The remaining part outside the directly affected area in the background object 38 is called the "indirectly affected area" 41, because its reference clocks could be located in the "directly affected area" 39, whose MC data may need to be recalculated. In this case, the motion vector is unchanged, but the prediction errors may have minor differences.
Another example problem with compositing MC/DCT data relates to scaling. Suppose it is desirable to scale down an image by a ratio of 2 to 1 on each side. As shown in FIG. 4, four neighboring current blocks 42A, 42B, 42C, 42D will be scaled down to become a new block 43. Each of the four current blocks has its own reference block 40A, 40B, 40C, 40D from the previous image frame. As shown in FIG. 4, the reference blocks do not necessarily form the same 2 by 2 area as the current blocks. Thus, after the scaling operation, the reference blocks are mapped to different current blocks. Therefore, recalculation of the MC data of the new down-scaled image 43 is necessary.
As described, given MC data, including motion vector and prediction errors, for the input video signals, one is not able to calculate new MC data for the composited video signal directly. It is necessary to reconstruct the video signals back in the uncompressed spatial domain or the DCT compressed domain, perform compositing there, and calculate the new MC data for the composited video signals. As mentioned earlier, the DCT compressed domain has a lower data rate and thus potentially offers a more efficient approach than the spatial domain. However, techniques for converting video signals from the MC/DCT format to the DCT format are not known.
One skilled in the art will recognize that the recalculation of the MC data, especially searching for the reference block in the previous frame, requires more computations than the process to reconstruct the video signal from the MC data. Thus, it would be desirable to provide a method and apparatus for reducing computations associated with the recalculation of MC data.
In view of these problems, prior art MC/DCT data processing apparatus rely upon full conversion to the uncompressed spatial domain for compositing operations. One such apparatus is portrayed in FIG. 5. FIG. 5 depicts a compositing apparatus in accordance with the prior art. The figure represents that a first input video signal is conveyed to a first MC/DCT encoder 50 and a second input video signal is conveyed to a second MC/DCT encoder 50 of the same construction, as will be described below. Each MC/DCT encoder 50 produces an error signal, in the DCT compressed format, and a motion vector signal. Both signals are then conveyed over a transmission channel 28B.
After transmission, the signals are received by MC/DCT decoders 52, which will be described below. The MC/DCT decoders 52 transform the signals into the spatial domain. This is necessary, in accordance with the prior art, in order to combine, or composite, the signals. Compositing is performed by a spatial domain compositing unit 54 which performs compositing operations on a pixel by pixel basis. The output of the spatial domain compositing unit 54 is a single composited signal, incorporating the two sets of image data.
The resultant composited video signal can be displayed on a video display device at the spatial domain compositing unit 54. If the compositing unit 54 is at some intermediate node, the composited video signal must be re-transformed to the compressed format by MC/DCT encoder 50. The MC/DCT encoder 50 receives the composited signal and re-transforms it into the MC/DCT domain, thereby generating error and motion vector signals. The DCT compressed error and motion vector signals are then transmitted through transmission channel 28C. After transmission to the ultimate user destination, the error and motion vector signals are transformed into a reconstructed video signal by MC/DCT decoder 52. The reconstructed video signal is then projected on a video display device 56.
FIG. 6 depicts a video display device 56C, projecting a number of images 57, typically seen in a video conferencing application. The individual images 57A, 57B, 57C, 57D, 57E, in sum, form a single composited image.
FIG. 7 depicts an MC/DCT encoder 50 in accordance with the prior art. The input video signal is conveyed to a motion vector calculator 53. The motion vector calculator 53 receives the previous frame data from frame memory 58. The motion vector calculator 53 compares the previous frame data with the present frame data to identify a reference block (as block 40 in FIG. 2A). The motion vector indicates the positional difference between the current image block and the reference image block. This motion vector forms one output of the MC/DCT encoder 50.
The motion vector calculator 53 also generates the reference image block which is subtracted from the input video frame to produce an error signal. The error signal corresponds to the difference in content between the reference block and the current block.
The error signal is conveyed to DCT formulator 22B and quantizer 24B to convert it to the DCT domain. The result of this processing is an error signal at the output of the MC/DCT encoder 50.
The error signal is also conveyed along a feedback path, as reflected in FIG. 7. In the feedback path, an inverse quantizer 34B and an inverse DCT formulator 36B are used to convert the error signal back to the spatial domain. The spatial domain error signal is then added to the previous reference block to form reconstructed video data. The reconstructed video data is conveyed to the frame memory 58 and will be used as the "previous frame" in the next frame cycle.
It should be recognized that the outputs of the MC/DCT encoder 50 may be further encoded with a variable length coder 26. For the sake of simplicity, variable length coders 26 will be omitted from future figures, although it will be appreciated that they may be used in accordance with the invention.
FIG. 8 depicts the operation of a MC/DCT decoder 52 in accordance with the prior art. The inputs to the MC/DCT decoder 52 are the error signal and the motion vector data. The error signal is processed by an inverse quantizer 34C and an inverse DCT formulator 36C to produce an error signal within the spatial domain. The spatial domain error signal is added with reference block data from the frame memory 59. The combination of the reference data and the spatial domain error signal result in the reconstructed video which is the output of the MC/DCT decoder 52. The reconstructed video is also conveyed to the frame memory for use in the next frame cycle.