Video files are composed of a plurality of still image frames, which are shown rapidly in succession as a video sequence (typically 15 to 30 frames per second) to create an idea of a moving image. Image frames typically comprise a plurality of stationary background objects defined by image information that remains substantially the same, and few moving objects defined by image information that changes somewhat. In such a case, the image information comprised by the image frames to be shown in succession is typically very similar, i.e. consecutive image frames comprise much redundancy. More particularly, the redundancy comprised by video files is dividable into spatial, temporal and spectral redundancy. Spatial redundancy represents the mutual correlation between adjacent image pixels; temporal redundancy represents the change in given image objects in following frames, and spectral redundancy the correlation between different colour components within one image frame.
Several video coding methods utilize the above-described temporal redundancy of consecutive image frames. In this case, so-called motion-compensated temporal prediction is used, wherein the contents of some (typically most) image frames in a video sequence are predicted from the other frames in the sequence by tracking the changes in given objects or areas in the image frames between consecutive image frames. A video sequence comprises compressed image frames, whose image information is determined without using motion-compensated temporal prediction. Such frames are called INTRA or I frames. Similarly, motion-compensated image frames comprised by a video sequence and predicted from previous image frames are called INTER or P frames (Predicted). Typically, at least one I frame and possibly one or more previously coded P frames are used in the determination of the image information of P frames. If a frame is lost, frames depending thereon can no longer be correctly decoded.
For example, JVT is a video coding standard that utilizes motion-compensated temporal prediction. JVT is the current project of the joint video team (JVT) of ISO/IEC Motion Picture Experts Group (MPEG) and ITU-T (International Telecommunications Union, Telecommunications Standardization Sector) Video Coding Experts Group (VCEG). It is inherited from H.26L, a project of the ITU-T VCEG.
In JVT/H.26L, images are coded using luminance and two colour difference (chrominance) components (Y, CB and CR). The chrominance components are each sampled at half resolution along both co-ordinate axes compared to the luminance component.
Each coded image, as well as the corresponding coded bit stream, is arranged in a hierarchical structure with four layers being, from top to bottom, a picture layer, a picture segment layer, a macroblock (MB) layer and a block layer. The picture segment layer can be either a group of blocks layer or a slice layer.
Data for each slice consists of a slice header followed by data for macroblocks (MBs). The slices define regions within a coded image. Each region is a number of MBs in a normal scanning order. There are no prediction dependencies across slice boundaries within the same coded image. However, temporal prediction can generally cross slice boundaries. Slices can be decoded independently from the rest of the image data. Consequently, slices improve error resilience in packet-lossy networks.
Each slice is divided into MBs. An MB relates to 16×16 pixels of luminance data and the spatially corresponding 8×8 pixels of chrominance data.
In the JVT/H.26L, a Video Coding Layer (VCL), which provides the core high-compression representation of the video picture content, and a Network Adaptation Layer (NAL), which packages that representation for delivery over a particular type of network, have been conceptually separated. The JVT/H.26L video coder is based on block-based motion-compensated hybrid transform coding. As with prior standards, only the decoding process is precisely specified to enable interoperability, while the processes for capturing, pre-processing, encoding, post-processing, and rendering are all left out of scope to allow flexibility in implementations. However, JVT/H.26L contains a number of new features that enable it to achieve a significant improvement in coding efficiency relative to prior standard designs.
JVT/H.26L is capable of utilizing a recently developed method called reference picture selection. Reference picture selection is a coding technique where the reference picture for motion compensation can be selected among multiple pictures stored in the reference picture buffer. Reference picture selection in JVT/H.26L allows selection of reference picture per macroblock. Reference picture selection can be used to improve compression efficiency and error resiliency.
Because of the motion compensation technique used in video coding, random access points have to be encoded in the video sequence to allow scanning of the video from an arbitrary point. Depending on the application used to scan the video sequence, a desirable time span between random access points in a video stream would be an order of 0,5-10 seconds. Coding of an intra frame has been a conventional solution for coding of random access points. However, as the above-mentioned reference picture selection technique allows referencing to frames prior to an intra frame, an intra frame as such is not a sufficient condition for a random access point. Furthermore, encoding of frequent intra frames in the video sequence requires more codec processing capacity and consumes more bandwidth.
Gradual decoder refresh refers to “dirty” random access, where previously coded but possibly non-received data is referred to and correct picture content is recovered gradually in more than one coded picture. In general, the gradual recovering of picture content provided by the gradual decoder refresh random access method is considered a desirable feature in JVT/H.26L video coding. The basic idea of the gradual decoder refresh is to encode a part of the macroblocks of the frames as intra-coded. When the decoder starts decoding at a random point, reference frames for motion compensation are unknown to the decoder, and they are initialised to mid-level grey, for example. The decoder can reconstruct intra-coded macroblocks, but inter-coded macroblocks referring to unknown areas in the motion compensation process cannot be reconstructed correctly. As the cumulative number of intra-coded macroblocks increases gradually frame by frame, a complete reconstructed picture may finally be obtained. However, this implementation involves several problems.
Due to reference picture selection, a macroblock in the reference frame may be referred to in the motion compensation process, which resides outside the region of reliably decodable intra-coded macroblocks.
In the JVT/H.26L, loop filtering is applied across each 4×4 block boundary to fade out abrupt borderlines. Thus, reliable areas may be affected by incorrectly reconstructed pixels in neighbouring macroblocks.
In the motion compensation process, referred non-integer pixel positions are interpolated from pixel values using multi-tap filter(s). In the current JVT codec design, half-pixel positions are interpolated using a six-tap filter. Thus, incorrectly reconstructed pixels may be used to interpolate a referred non-integer pixel position residing inside but close to the border of the reliably decodable area.
When the decoder starts the decoding of the frames, it assumes all intra-coded macroblocks to be reliable. However, all the aforementioned processes have the effect that the grey image information of the neighbouring macroblocks will intermingle with the reliably decodable image information of the intra-coded macroblocks. This causes an error that will propagate spatio-temporally when the decoding progresses from a frame to another.
A further problem in the process of gradual decoder refresh relates to poor coding efficiency, when indicating the macroblocks belonging to the initial region and the shape and the growth rate of the region. The information needs to be indicated to the decoder, which always causes some overhead bits to be included in the bitstream of the video sequence, the amount of overhead bits typically increasing significantly, if all the above-mentioned constraints are signaled separately. Accordingly, there is a need for a more efficient method for indicating the pattern how the region is evolving to the decoder.