Video editing capability is an increasingly requested feature in video playing and/or capturing devices. Transitional effects between different video-sequences, logo insertion and over-layering sequences are among the most widely used operations in editing. Video editing tools enable users to apply a set of effects on their video clips aiming to produce a functionally and aesthetically better representation of their video.
To apply video editing effects on video sequences, several commercial products exist. These software products are targeted mainly for the PC platform. Because processing power, storage and memory constraints are not an issue in the PC platform today, the techniques utilized in such video-editing products operate on the video sequences mostly in their raw formats in the spatial domain. With such techniques, the compressed video is first decoded and then the editing effects are introduced in the spatial domain. Finally, the video is again encoded. This is known as spatial domain video editing operation.
For devices with low resources in processing power, storage space, available memory and battery power, decoding a video sequence and re-encoding it are costly operations that take a long time and consume a lot of battery power. Many of the latest communication devices, such as mobile phones, communicators and PDAs, are equipped with video cameras, offering users the capability to shoot video clips and send them over wireless networks. It is advantageous and desirable to allow users of those communication devices to generate quality video at their terminals. The spatial domain video editing operation is not suitable in wireless cellular environments.
As mentioned above, most video effects are performed in the spatial domain in prior art. In the case of video blending (transitional effects for fading, etc.) between two or more sequences, for instance, video clips are first decompressed and then the effects are performed according to the following equation:{tilde over (V)}(x,y,t)=α1V1(x,y,t)+α2V2(x,y,t)  (1)where {tilde over (V)}(x,y,t) is the edited sequence from the original sequences V1(x,y,t) and V2(x,y,t). α1 and α2 are two weighting parameters chosen according to the desired effect. Equation (1) is applied in the spatial domain for the various color components of the video sequence depending on the desired effect.
Finally, the resulting edited image sequence is re-encoded. The major disadvantage of this approach is that it is significantly computationally intensive, especially in the encoding part. Typical complexity ratio between generic encoders and decoders is approximately four. Using this conventional spatial-domain editing approach, all of the video frames coming right after the transition effect in the second sequence must be re-encoded.
Furthermore, it is not unusual that editing operations are usually repeated several times by users before the desired result is achieved. The repetition adds to the complexity of the editing operations, and requires more processing power. It is therefore important to develop efficient techniques minimizing the decoding and encoding operations, functioning in the compressed domain, to perform such editing effects.
In order to perform efficiently, video compression techniques exploit spatial redundancy in the frames forming the video. First, the frame data is transformed to another domain, such as the Discrete Cosine Transform (DCT) domain, to decorrelate it. The transformed data is then quantized and entropy coded.
In addition, the compression techniques exploit the temporal correlation between the frames: when coding a frame, utilizing the previous, and sometimes the future, frames(s) offers a significant reduction in the amount of data to compress.
The information representing the changes in areas of a frame can be sufficient to represent a consecutive frame. This is called prediction and the frames coded in this way are called predicted (P) frames or Inter frames. As the prediction cannot be 100% accurate (unless the changes undergone are described in every pixel), a residual frame representing the errors is also used to compensate the prediction procedure.
The prediction information is usually represented as vectors describing the displacement of objects in the frames. These vectors are called motion vectors. The procedure to estimate these vectors is called motion estimation. The usage of these vectors to retrieve frames is known as motion compensation.
Prediction is often applied on blocks within a frame. The block sizes vary for different algorithms (e.g. 8×8 or 16×16 pixels, or 2n×2m pixels with n and m being positive integers). Some blocks change significantly between frames, to the point that it is better to send all the block data independently from any prior information, i.e. without prediction. These blocks are called Intra blocks.
In video sequences there are frames, which are fully coded in Intra mode. For example, the first frame of the sequence is usually fully coded in Intra mode, because it cannot be predicted from an earlier frame. Frames that are significantly different from previous ones, such as when there is a scene change, are usually also coded in Intra mode. The choice of the coding mode is made by the video encoder. FIGS. 1 and 2 illustrate a typical video encoder 410 and decoder 420 respectively.
The decoder 420 operates on a multiplexed video bit-stream (includes video and audio), which is demultiplexed to obtain the compressed video frames. The compressed data comprises entropy-coded-quantized prediction error transform coefficients, coded motion vectors and macro block type information. The decoded quantized transform coefficients c(x,y,t), where x,y are the coordinates of the coefficient and t stands for time, are inversely quantized to obtain transform coefficients d(x,y,t) according to the following relation:d(x,y,t)=Q−1(c(x,y,t))  (3)where Q−1 is the inverse quantization operation. In the case of scalar quantization, equation (3) becomesd(x,y,t)=QPc(x,y,t)  (4)where QP is the quantization parameter. In the inverse transform block, the transform coefficients are subject to an inverse transform to obtain the prediction error Ec(x,y,t):Ec(x,y,t)=T−1(d(x,y,t))  (5)where T−1 is the inverse transform operation, which is the inverse DCT in many compression techniques.
If the block of data is an intra-type macro block, the pixels of the block are equal to Ec(x,y,t). In fact, as explained previously, there is no prediction, i.e.:R(x,y,t)=Ec(x,y,t).  (6)If the block of data is an inter-type macro block, the pixels of the block are reconstructed by finding the predicted pixel positions using the received motion vectors (Δx,Δy) on the reference frame R(x,y,t−1) retrieved from the frame memory. The obtained predicted frame is:P(x,y,t)=R(x+Δx,y+Δy,t−1)  (7)The reconstructed frame isR(x,y,t)=P(x,y,t)+Ec(x,y,t)  (8)
In general, blending, transitional effects, logo insertion and frame superposition are editing operations which can be achieved by the following operation:
                                          V            ~                    ⁡                      (                          x              ,              y              ,              t                        )                          =                              ∑                          i              =              1                        N                    ⁢                                                    α                i                            ⁡                              (                                  x                  ,                  y                  ,                  t                                )                                      ⁢                                          V                i                            ⁡                              (                                  x                  ,                  y                  ,                  t                                )                                                                        (        9        )            where {tilde over (V)}(x,y,t) is the edited sequence from the N Vi(x,y,t) original sequences and t is the time index for which the effect would take place. The parameter αi(x,y,t) represents the modifications for introducing on Vi(x,y,t) for all pixels (x,y) at the desired time t.
For the sake of simplicity, we consider the case when N=2, i.e., the editing is performed using two input sequences. Nevertheless, it is important to stress that all of the following editing discussion can be generalized to n arbitrary input frames to produce one edited output frame.
For N=2, Equation (9) can be written as Equation (1):{tilde over (V)}(x,y,t)=α1(x,y,t)V1(x,y,t)+α2(x,y,t)V2(x,y,t)