A video sequence consists of a series of still pictures or frames. Video compression methods are based on reducing the redundant and perceptually irrelevant parts of video sequences. The redundancy in video sequences can be categorised into spectral, spatial and temporal redundancy. Spectral redundancy refers to the similarity between the different colour components of the same picture, while spatial redundancy results from the similarity between neighbouring pixels in a picture. Temporal redundancy exists because objects appearing in a previous image are also likely to appear in the current image. Compression can be achieved by taking advantage of this temporal redundancy and predicting the current picture from another picture, termed an anchor or reference picture. In practice this is achieved by generating motion compensation data that describes the motion between the current picture and the previous picture.
Video compression methods typically differentiate between pictures that utilise temporal redundancy reduction and those that do not. Compressed pictures that do not utilise temporal redundancy reduction methods are usually called INTRA-(or I) frames or pictures. Temporally predicted images are usually forwardly predicted from a picture occurring before the current picture and are called INTER or P-frames. In the case of INTER frames, the predicted motion-compensated picture is rarely precise enough and therefore a spatially compressed prediction error frame is associated with each INTER frame. INTER pictures may contain INTRA-coded areas.
Many video compression schemes also use temporally bi-directionally predicted frames, which are commonly referred to as B-pictures or B-frames. B-pictures are inserted between anchor picture pairs of I- and/or P-frames and are predicted from either one or both of the anchor pictures. B-pictures normally yield increased compression compared with forward-predicted INTER-coded P-pictures. B-pictures are not used as anchor pictures, i.e. other pictures are not predicted from them. Therefore, they can be discarded (intentionally or unintentionally) without impacting the picture quality of future pictures. Whilst B-pictures may improve compression performance compared with P-pictures, their generation requires greater computational complexity and memory usage, and they introduce additional delays. This may not be a problem for non-real time applications such as video streaming but may cause problems in real-time applications such as video-conferencing.
Thus, as explained above, a compressed video clip typically consists of a sequence of pictures, which can be roughly categorised into temporally independent INTRA pictures, temporally differentially coded INTER pictures and (possibly) bi-directionally predicted B-pictures. Since the compression efficiency of INTRA-coded pictures is normally lower than that of INTER-coded pictures, INTRA pictures are used sparingly, especially in low bit-rate applications. However, because INTRA-coded pictures can be decoded independent of any other picture in the video sequence, each INTRA-picture represents an entry (or random access point) into the encoded video sequence i.e. a point from which decoding can be started. Thus, it is advantageous to include a certain number of INTRA-coded pictures in an encoded video sequence, for example at regular intervals, in order to allow random access into the sequence. Furthermore, a typical video sequence includes a number of scenes or shots. As the picture contents may be significantly different from one scene to another, it is also advantageous to encode the first picture of each new scene in INTRA format. In this way, even if no other INTRA-coded frames are included in the encoded sequence, at least the first frame in each scene provides a random access point. Each independently decodable series of pictures within an encoded video sequence, starting with an INTRA-coded frame (constituting a random access point) and ending at the frame immediately preceding the next INTRA-coded frame, is commonly referred to as a Group of Pictures or GOP for short.
Some random access operations are generated by the end-user (e.g. a viewer of the video sequence), for example as the result of the user seeking a new position in a streamed video file. In this case, the decoder is likely to get an indication of a user-generated random access operation and can act accordingly. However, in some situations, random access operations are not controlled by the end-user. For example, a spliced or edited stream may contain “cuts” in the coded stream with characteristics similar to random access operations performed by a user. However, in this latter case the decoder may not receive any indication that such a cut has occurred and may not be able to decode subsequent pictures in the sequence correctly. It is therefore important for a video decoder to be provided with a reliable method for detecting random access operations or cuts in an encoded video stream.
Modern video coding standards define a syntax for a self-sufficient video bit-stream. The most popular standards at the time of writing are International Telecommunications Union ITU-T Recommendation H.263, “Video coding for low bit rate communication”, February 1998; International Standards Organisation/International Electro-technical Commission ISO/IEC 14496-2, “Generic Coding of Audio-Visual Objects. Part 2: Visual”, 1999 (known as MPEG-4); and ITU-T Recommendation H.262 (ISO/IEC 13818-2) (known as MPEG-2). These standards define a hierarchy for bit-streams and correspondingly for image sequences and images. Development of further video coding standards is still ongoing. In particular, standardisation efforts in the development of a long-term successor for H.263, known as ITU-T H.264|ISO/IEC MPEG-4 part 10 are now being conducted jointly under the auspices of a standardisation body known as the Joint Video Team (JVT) of ISO/IEC MPEG (Motion Pictures Expert Group) and ITU-T VCEG (Video Coding Experts Group). Some particular aspects of these standards and, in particular, those features of the H.264 video coding standard relevant to the present invention are described below.
FIG. 1 illustrates a conventional coded picture sequence comprising INTRA-coded I-pictures, INTER-coded P-pictures and bi-directionally coded B-pictures arranged in a pattern having the form I B B P . . . etc. Boxes indicate frames in presentation order, arrows indicate motion compensation, the letters in the boxes indicate frame types and the values in the boxes are frame numbers (as specified according to the H.264 video coding standard), indicating the coding/decoding order of the frames.
The term “leading frame” or “leading picture” is used to describe any frame or picture that cannot be decoded correctly after accessing the previous I-frame randomly and whose presentation time is before the I-frame's presentation time. (B-frames B17 in FIG. 1 are examples of leading frames). In this description, the term “open decoder refresh” (ODR) picture is used to denote a randomly accessible frame with leading pictures.
Coded frame patterns similar to that shown in FIG. 1 are common and thus it is desirable to make random access to ODR pictures as easy as possible.
A number of alternatives already exist for accessing ODR pictures. A typical solution is simply to discard any leading B-pictures. This is the approach typically adopted in video coding standards that do not allow reference picture selection and decoupling of decoding and presentation order, where an I-picture is always a random access point.
Another solution to the problem is to consider all non-stored frames immediately following an I-frame (in coding/decoding order) as leading frames. While this approach works in the simple case depicted in FIG. 1, it lacks the property of handling stored leading frames. An example of a coding scheme in which there is a stored leading frame before a randomly accessible I-frame is shown in FIG. 2. The simple implicit identification of leading frames, just described, does not work correctly in this example.
A further straightforward idea is to consider all B-pictures occurring after an I-frame (in coding/decoding order) as leading pictures. However, leading pictures may not always be B pictures. For example, the scientific article by Miska M. Hannuksela, entitled: “Simple Packet Loss Recovery Method for Video Streaming”, Proceedings of Packet Video Workshop 2001, Kyongju, South Korea, Apr. 30-May 1, 2001 and ITU-T SG16/Q15 document Q15-K38 propose an INTRA-frame postponement method for improved error resiliency in video coding, the adoption of which renders this simple method for the identification of leading frames unworkable. FIG. 3 shows an example of an INTRA frame postponed by one stored frame interval. Consequently, there is one P-frame (P17) preceding the INTRA frame in presentation order.
JVT document JVT-B063 proposes that a frame can be associated with an initialization delay (provided in the video bit-stream as Supplemental Enhancement information) that indicates how long it takes for all subsequent frames in presentation order to be completely correct in content after starting decoding from a particular frame. This initialization delay SEI information may be used when accessing ODR pictures. However, there are three disadvantages associated with this approach. Firstly, the decoder process for handling SEI messages is non-normative i.e. it is not a mandatory part of the H.264 standard and therefore does not have to be supported by all decoders implemented according to H.264. Thus, there could be a standard-compliant SEI-unaware decoder that accesses a standard-compliant stream randomly but fails to decode it due to absent reference frames for leading pictures. Secondly, the decoder may decode some data, such as stored leading frames, unnecessarily as it does not know that they are not useful for the refresh operation. Thirdly, the decoder operation for referring to missing frame numbers becomes more complicated. Consequently, this approach is not preferred as a solution to the random accessing of ODR pictures.
The H.264 video coding standard (as specified in the JVT committee draft) includes the concepts of “instantaneous decoder refresh” and “independent GOP”. The term instantaneous decoder refresh refers to a “clean” random access method, where no data prior to an INTRA frame is referred to in the decoding process. An independent GOP is a group of pictures that can be decoded independently from previous or later pictures. An “Instantaneous Decoder Refresh” (IDR) picture signals the start of a new independent GOP. Thus, according to H.264, an IDR picture can be used as a random access point. (For further details, see document JVT-B041 which analyzes the requirements for instantaneous decoder refresh, and JVT-C083 which proposes the syntax, semantics, and standard text for the feature.)
Another concept proposed for inclusion in the H.264 video coding standard is that of “gradual decoder refresh” (GDR). This refers to a form of so-called “dirty” random access, where previously coded but possibly non-received data is referred to and the correct picture content is recovered gradually over more than one coded picture. GDR allows random access capabilities using any type of frame. A signaling mechanism for GDR was first proposed in JVT document JVT-B063 (and then in the JVT output document JVT-B109). JVT-B063 concluded that there are basically two fundamental alternatives to initialize the GDR decoding process, “best-effort decoding” and “assured decoding”. In best-effort decoding all unavailable frames are initialized to mid-level gray and decoding of all frames is started but they are considered completely correct in content only after certain indicated conditions are fulfilled. In “assured decoding” the decoder starts decoding from an I-frame and then waits before attempting to decode any more non-I frames to ensure that the remaining frames contain no references to unavailable data. The best-effort alternative was preferred in JVT-B063.
Issues relating to the coding of gradual decoder refresh were studied in JVT document JVT-C074. This document concluded that GDR was impossible to realize using the version of the JVT H.264 codec valid at that time and proposed that a method known as the “isolated region technique” (IREG) should be used for GDR coding.
The isolated region technique was proposed in JVT document JVT-C072. An isolated region is a solid area of macroblocks, defining the shape of the border across which loop filtering should be turned off and to which spatial in-picture prediction is limited. Temporal prediction outside isolated regions in reference frames should be disallowed. The shape of an isolated region may evolve during a number of consecutive coded pictures. The group of pictures (GOP), within which the shape of an isolated region depends on the shape of the corresponding isolated region in a previous picture and which includes the picture containing the initial isolated region coded without temporal prediction, is referred to as a “group of pictures with evolutionary isolated regions” (IREG GOP). The corresponding period (in terms of coded reference frames) is called the “period of evolutionary isolated regions” or “IREG period”.
As mentioned above, IREG provides an elegant solution for enabling GDR functionality and can also be used to provide error resiliency and recovery (see JVT document JVT-C073), region-of-interest coding and prioritization, picture-in-picture functionality, and coding of masked video scene transitions (see document JVT-C075). Gradual random access based on IREG, enables media channel switching for receivers, bit-stream switching for a server, and further allows newcomers easy access in multicast streaming applications.
The improved error resiliency property and the gradual decoder refresh property of isolated regions are applicable at the same time. Thus, when an encoder uses isolated regions to achieve gradual decoder refresh, it gets improved error resiliency “for free” without additional bit-rate or complexity cost, and vice versa.
A further concept included in the H.264 video coding standard is that of “flexible macroblock order” (FMO). FMO was first proposed in JVT contribution JVT-C089, and was then included in the JVT committee draft of the H.264 standard. By partitioning pictures into slice groups, FMO allows the coding of macroblocks in an order other than the typical raster scan order. The key application enabled by this mechanism is the implementation of error resilience methods such as scattered slices (see JVT document JVT-C090) and slice interleaving (as proposed in JVT document JVT-C091). Due to its flexibility, other applications of flexible macroblock order are also possible. JVT document JVT-D095 proposes a few enhancements to FMO.
Turning off of the loop filter at slice boundaries was proposed in document JVT-C117 to improve error resilience and to support perfect GDR. This loop filter limitation has two additional advantages: firstly it provides a good solution to the parallel processing problem inherent in the FMO technique and secondly it is a necessity to enable correct decoding of out-of-order slices in time.