Video files are composed of a plurality of still image frames, which are shown rapidly in succession as a video sequence (typically 15 to 30 frames per second) to create an idea of a moving image. Image frames typically comprise a plurality of stationary background objects defined by image information that remains substantially the same, and few moving objects defined by image information that changes somewhat. In such a case, the image information comprised by the image frames to be shown in succession is typically very similar, i.e. consecutive image frames comprise much redundancy. In fact, the redundancy comprised by video files is dividable into spatial, temporal and spectral redundancy. Spatial redundancy represents the mutual correlation between adjacent image pixels; temporal redundancy represents the change in given image objects in following frames, and spectral redundancy the correlation between different colour components within one image frame.
Several video coding methods utilize the above-described temporal redundancy of consecutive image frames. In this case, so-called motion-compensated temporal prediction is used, wherein the contents of some (typically most) image frames in a video sequence are predicted from the other frames in the sequence by tracking the changes in given objects or areas in the image frames between consecutive image frames. A video sequence comprises compressed image frames, whose image information is determined without using motion-compensated temporal prediction. Such frames are called INTRA or I frames. Similarly, motion-compensated image frames comprised by a video sequence and predicted from previous image frames are called INTER or P frames (Predicted). One I frame and possibly one or more previously coded P frames are used in the determination of the image information of P frames. If a frame is lost, frames depending thereon can no longer be correctly decoded.
Typically, an I frame initiates a video sequence defined as a Group of Pictures (GOP), the image information of the P frames comprised by which can be defined using only the I frames comprised by said group of pictures GOP and previous P frames. The following I frame again initiates a new group of images GOP, and the image information of the frames comprised by it cannot thus be defined on the basis of the frames in a previous group of pictures GOP. Accordingly, groups of pictures GOP do not temporally overlap and each group of pictures can be independently decoded. In addition, many video compression methods use bi-directionally predicted B frames, which are placed between two anchor frames (I and P frame or two P frames) within a group of pictures GOP, and the image information of the B frame is predicted from both the previous anchor frame and the anchor frame following the B frame. B frames thus provide image information of a better quality than do P frames, but they are typically not used as an anchor frame and discarding them from the video sequence does therefore not cause any deterioration of the quality of subsequent pictures.
Each image frame is dividable into macro blocks that comprise the colour components (e.g. Y, U, V) of all pixels from a rectangular image area. More precisely, a macro block is composed of three blocks, each block comprising colour values (e.g. Y, U or V) from one colour layer of the pixels from said image area. The spatial resolution of the blocks may be different from that of the macro block; for example, components U and V can be presented at only half the resolution compared with component Y. Macro blocks can also be used to form for example slices, which are groups of several macro blocks wherein the macro blocs are typically selected in the image scanning order. In fact, in video coding methods, temporal prediction is typically performed block or macro block-specifically, not image frame-specifically.
Many video materials, such as news, music videos and movie trailers comprise rapid cuts between different image material scenes. Sometimes cuts between different scenes are abrupt, but often scene transition is used, i.e. the transition from scene to scene takes place for instance by fading, wiping, tiling or rolling the image frames of a previous scene, and by bringing forth the scenes of a subsequent scene. As regards coding efficiency, the video coding of a scene transition is often a serious problem, since the image frames during a scene transition comprise information on the image frames of both the ending scene and the beginning scene.
A typical scene transition, fading, is performed by lowering the intensity or luminance of the image frames in a first scene gradually to zero and simultaneously raising the intensity of the image frames in a second scene gradually to its maximum value. Such a scene transition is called a cross-faded scene transition. A second typical scene transition, tiling, is performed by randomly or pseudo-randomly discarding square parts from the image frames of a first scene, and replacing the discarded parts with bits taken from the corresponding places in a second scene. Some typical scene transitions, such as roll, push, door etc., are accomplished by ‘fixing’ the first image frames on the surface of a virtual object (a paper sheet, a sliding door or an ordinary door) or some other arbitrary object, and turning this object or piece gradually away from sight, whereby information about the image frames of a second scene is copied to the emerging image areas. Many other transitions are known and used in several commercially available products, such as Avid Cinema™ (Avid Technology Inc.).
Present video coding methods utilize several methods of coding scene transitions. For example, in the coding according to the ITU-T (International Telecommunication Union, Telecommunication Standardization Sector) H.263 standard, the above-described B frames are usable for presenting image frames during a scene transition. In this case, one image frame from a first (ending) scene and one image frame from a second (beginning) scene are selected as anchor frames. The image information of the B frames inserted between these during the scene transition is defined from these anchor frames by temporal prediction such that the pixel values of the predicted image blocks are calculated as average values of the pixel values of the motion-compensated prediction blocks of the anchor frames.
As regards coding efficiency, such a solution is, however, disadvantageous particularly if coding the scene transition requires that several B frames be inserted between the anchor frames. In fact, the coding has been improved in the ITU-T H.26L standard such that the image information of the B frames inserted between the anchor frames during the scene transition is defined from these anchor frames by temporal prediction such that the pixel values of the B image frames are calculated as weighted average values of the pixel values of the anchor frames based on the temporal distance of each B frame from both anchor frames. This improves the coding efficiency of scene transitions made by fading, in particular, and also the quality of the predicted B frames.
Generally speaking, it is feasible that a computer-generated image is made of layers, i.e. image objects. Each of these image objects is definable by three types of information: the texture of the image object, its shape and transparency, and the layering order (depth) relative to the background of the image and other image objects. For example, MPEG-4 video coding uses some of these information types and the parameters values defined for them in coding scene transitions.
Shape and transparency are often defined using an alpha plane, which measures non-transparency, i.e. opacity and whose value is usually defined separately for each image object, possibly excluding the background, which is usually defined as opaque. It can be defined that the alpha plane value of an opaque image object, such as the background, is 1.0, whereas the alpha plane value of a fully transparent image object is 0.0. Intermediate values define how strongly a given image object is visible in the image relative to the background and other at least partly superposed image objects that have a higher depth value relative to said image object.
Layering image objects on top of each other according to their shape, transparency and depth position is called scene composition. In practice, this is based on the use of weighted average values. The image object closest to the background, i.e. positioned the deepest, is first positioned on top of the background, and a combined image is created from these. The pixel values of the composite image are determined as an average value weighted by the alpha plane values of the background image and said image object. The alpha plane value of the combined image is then defined as 1.0, and it then becomes the background image for the following image object. The process continues until all image objects are combined with the image.
The above-described process for coding a scene transition is used for instance in MPEG-4 video coding such that image frames in a beginning scene are typically selected as background images, whose opacity has a full value, and the opacity of image frames in an ending scene, the frames being ‘image objects’ to be positioned on top of the background, is reduced during the scene transition. When the opacity, i.e. alpha plane value, of the image frames of the ending scene reaches zero, only the image frame of the beginning scene is visible in the final image frame.
However, prior art scene transition coding involves several problems. The use of weighted anchor frame average values in the prediction of B frames does not work well in situations wherein the duration of the scene transition is long and the images include much motion, which considerably lowers the compression efficiency of coding based on temporal prediction. If the B pictures used in the scene transition are used for traffic shaping for instance in a streaming server, the image rate of the transmitted sequence temporarily decreases during the scene transition, which is usually observed as image jerks.
A problem in the method used in MPEG-4 video coding is the complexity of coding a scene transition. In MPEG-4 video coding, scene composition always takes place by means of a system controlling the video coding and decoding, since an individual MPEG-4 video sequence cannot contain the information required for composing a scene from two or more video sequences. Consequently, composing a scene transition requires control-level support for the actual process and simultaneous transfer of two or more video sequences, which typically requires a wider bandwidth, at least temporarily.