A typical video stream comprises a sequence of pictures, often referred to as frames. The frames comprise pixels arranged into a rectangular form. In existing video coding standards, such as H.261, H.262, H.263, H.264 and MPEG-4, three main types of pictures are defined: Intra frames (I-frames), Predictive frames (P-frames) and Bi-directional frames (B-frames). Each picture type exploits a different type of redundancy in a sequence of images and consequently results in different level of compression efficiency and, as explained in the following, provides different functionality within the encoded video sequence. An intra frame is a frame of video data that is coded by exploiting only the spatial correlation of the pixels within the frame itself without using any information from the past or the future frames.
Intra frames are used as the basis for decoding/decompression of other frames and provide access points to the coded sequence where decoding can begin.
A predictive frame is a frame that is encoded/compressed using motion compensated prediction from a so-called reference frame, i.e. one or more previous/subsequent Intra frames or Predictive frames available in an encoder or in a decoder. A bi-directional frame is a frame that is encoded/compressed by prediction from a previous Intra frame or Predictive frame and/or a subsequent Intra frame or Predictive frame.
Since adjacent frames in a typical video sequence are highly correlated, higher compression can be achieved when using Bi-directional or Predictive frames instead of Intra frames. On the other hand, when the temporal predictive coding is employed within the coded video stream, B-frames and/or P-frames cannot be decoded without correctly decoding all the other previous and/or subsequent reference frames which were used with coding of the Bi-directional and Predictive frames. In situations in which the reference frame(s) used in the encoder and respective reference frame(s) in the decoder are not identical either due to errors during transmission or due to some intentional action on the transmitting side, the subsequent frames that make use of prediction from such a reference frame can not be reconstructed on the decoding side to yield a decoded frame identical to that originally encoded on the encoding side. This mismatch is not only confined to a single frame but further propagates in time due to the use of motion compensated coding.
FIGS. 1A-1C illustrate the types of encoded/compressed video frames used in a typical video encoding/decoding system. For example, prior to encoding, the pictures of the video sequence are represented by these matrices of multiple-bit numbers, one representing the luminance (brightness) of the image pixels, and the other two each representing a respective one of two chrominance (color) components. FIG. 1A depicts the way in which an Intra frame 200 is encoded using only image information present in the frame itself. FIG. 1B illustrates construction of a Predictive frame 210. Arrow 205a represents the use of motion compensated prediction to create the P-frame 210. FIG. 1C depicts construction of Bi-directional frames 220. B-frames are usually inserted between I-frames or P-frames. FIG. 2 represents a group of pictures in display order and illustrates how B-frames inserted between I- and P-frames, as well as showing the direction in which motion compensation information flows. In FIGS. 1B, 1C and 2, arrows 205a depict forward motion compensation prediction information necessary to reconstruct P-frames 210, whereas arrows 215a and 215b depict motion compensation information used in reconstructing B-frames 220 in forward direction (215a) and backward direction (215b). In other words, the arrows 205a and 215a show the flow of information when predictive frames are predicted from frames that are earlier in display order than the frame being reconstructed, and arrows 215b show the flow of information when predictive frames are predicted from frames that are later in display order than the frame being reconstructed.
In motion compensated prediction, the similarity between successive frames in a video sequence is utilized to improve coding efficiency. More specifically, so-called motion vectors are used to describe the way in which pixels or regions of pixels move between successive frames of the sequence. The motion vectors provide offset values and error data that refer to a past or a future frame of video data having decoded pixel values that may be used with the error data to compress/encode or decompress/decode a given frame of video data.
The capability to decode/decompress P-frames requires the availability of the previous I- or P-reference frame, furthermore in order to decode a B-frame requires the availability of the subsequent I- or P-reference frame is also required. For example, if an encoded/compressed data stream has the following frame sequence or display order:
I1B2B3P4B5P6B7P8B9B10P11 . . . Pn−3Bn−2Pn−1In,
the corresponding decoding order is:
I1P4B2B3P6B5P8B7P11B9B10 . . . Pn−1Bn−2In.
The decoding order differs from the display order because the B-frames require future I- or P-frames for their decoding. FIG. 2 displays the beginning of the above frame sequence and can be referred to in order to understand the dependencies of the frames, as described earlier. P-frames require the previous I- or P-reference frame be available. For example, P4 requires I1 to be decoded. Similarly, frame P6 requires that P4 be available in order to decode/decompress frame P6. B-frames, such as frame B3, require a past and/or a future I- or P-reference frame, such as P4 and I1 in order to be decoded. B-frames are frames between I- or P-frames during encoding.
Video streaming has emerged as an important application in the fixed Internet. It is further anticipated that video streaming will also be important in the future of 3G wireless networks. In streaming applications the transmitting server starts transmitting a pre-encoded video bit stream via a transmission network to a receiver upon a request from the receiver. The receiver plays the video stream back while receiving it. The best-effort nature of present networks causes variations in the effective bandwidth available to a user due to the changing network conditions. To accommodate these variations, the transmitting server can scale the bit rate of the compressed video. In the case of a conversational service characterized by real-time encoding and point-to-point delivery, this can be achieved by adjusting the source encoding parameters on the fly. Such adjustable parameters can be, for example, a quantisation parameter, or a frame rate. The adjustment is advantageously based on feedback from the transmission network. In typical streaming scenarios when a previously encoded video bit stream is to be transmitted to the receiver, the above solution cannot be applied.
One solution to achieve bandwidth scalability in case of pre-encoded sequences is to produce multiple and independent streams having different bit-rates and quality. The transmitting server then dynamically switches between the streams to accommodate variations in the available bandwidth. The following example illustrates this principle. Let us assume that multiple bit streams are generated independently with different encoding parameters, such as quantisation parameter, corresponding to the same video sequence. Let {P1,n−1, P1,n, P1,n+1} and {P2,n−1, P2,n, P2,n+1} denote the sequence of decoded frames from bit streams 1 and 2, respectively. Since the encoding parameters are different for the two bit streams, frames reconstructed from them at the same time instant, for example, frames P1,n−1 and P2,n−1, are not identical. If it is now assumed that the server initially sends encoded frames from bit stream 1 up to time n after which it starts sending encoded frames from bit stream 2, the decoder receives frames {P1,n−2, P1,n−1, P2,n, P2,n+1, P2,n+2}. In this case P2,n cannot be correctly decoded since its reference frame P2,n−1 is not received. On the other hand, the frame P1,n−1, which is received instead of P2,n−1, is not identical to P2,n−1.
Therefore switching between bit streams at arbitrary locations leads to visual artefacts due to the mismatch between the reference frames used for motion compensated prediction in the different sequences. These visual artefacts are not only confined to the frame at the switching point between bit streams, but propagates in time due to the continued motion compensated coding in the remaining part of the video sequence.
A video streaming/delivery system inevitably suffers from video quality degradation due to transmission errors. The transmission errors can be roughly classified into random bit errors and erasure errors (packet loss). Many error control and concealment techniques try to avoid this problem by forward error concealment, post-processing and interactive error concealment. The predicted video coding mechanism has low tolerance on packet loss where the error caused by a missing block will propagate and thus create objectionable visual distortion. The intra macroblock insertion, which is based on the forward error concealment, can stop the error propagation by introducing a self-contained intra macroblock and concealing the erroneous block. The problem with the introduced intra macroblock is that the coding of such a macroblock increases the amount of information of the bit stream, thus reducing coding efficiency, and that it is not scalable.
A good error resilience tool is important when retransmission for lost packet is not possible. An Adaptive Intra Refresh (AIR) system described in MPEG-4 standard (Worral, “Motion Adaptive Intra Refresh for MPEG-4”, Electronics Letters November 2000) Worral mentions the inserting intra macroblocks at later and later positions in succeeding frames as part of a motion-adaptive scheme. Deciding when to insert the macroblocks (when bandwidth is available for that frame) is shown to benefit from identifying image areas with high motion. Worral notes that his approach is backward-compatible with the standard (does not require a standard change). The encoder moves down the frame encoding intra macroblocks until the number of preset macroblocks have been encoded. For the next frame the encoder starts in the same position, and begins encoding intra macroblocks.
The purpose of the insertion of intra macroblocks is to try to minimize the propagation of artefacts caused by an erroneous macroblock and to stop the propagation. Another alternative is the Random Intra Refresh (RIR) used in the JM61e H.264 reference software where intra macroblocks are randomly inserted. However, as soon as the intra macroblock is inserted it cannot be replaced by a predicted block which in general is much smaller in size. In another words, the coding efficiency is fixed for systems based on the Adaptive Intra Refresh or the Random Intra Refresh. For a wireless connection the packet loss rate is different from time to time, wherein schemes such as AIR cannot reflect the packet loss rate to optimize for the performance. In another words, the error protection of AIR is non-scalable. In good connection conditions the quality is not optimized due to the inserted intra blocks.
It is important for Video Streaming Server to be able to adapt to different connection conditions and different network types such as wired and wireless networks. Bitstream switching scheme where multiple bitstreams are used provides a low complexity way for a server to adapt to varying connection conditions without re-encoding video content, which requires high computation power. However, switching from one bitstream to another produces pixel drift problem if the switching takes place at a predicted frame. Since the reference frame is taken from another bitstream, the mismatch would propagate and thus degrade the video quality.
The problem with bitstream switching is that the switching point must be an intra frame (key frame), otherwise a pixel mismatch which degrades the video quality will occur until the next intra frame. During a video streaming session it is desirable that the switching can take place at any frame. However, it is not easy to implement such a system without affecting significant reduction to coding efficiency.
Regular intra frames can be used to provide switching points. But, more frequent the intra frames more bits are required which will lower the video quality. One scheme provides extra bitstream with all intra frames at a certain period of, say, one second and during switching the intra frame will be used for switching, which will minimize the prediction error. Another simple technique is just to switch at any frame, which in general suffers from pixel drift quite significantly.
A correct (mismatch-free) switching between video streams can be enabled by forming a special type of a compressed video frame and inserting frames of the special type into video bit-streams at locations where switching from one bit-stream to another is to be allowed. The patent application WO02054776 describes switching frames which are used for enabling the system to perform the switching from one bit stream to another without the need to insert Intra frames into the bit stream for switching locations. The special type of compressed video frame will be referred to generally as an S-frame (Switching). More specifically, S-frames may be classified as SP-frames (Switching Predictive), which are formed at the decoder using motion compensated prediction from already decoded frames using motion vector information, and SI-frames, which are formed at the decoder using spatial (intra) prediction from already decoded neighbouring pixels within a frame being decoded. In general, an S-frame is formed on a block-by-block basis and may comprise both inter-coded (SP) blocks as well as intra-coded (SI) blocks (Switching Intra).
The special type of frame allows switching between bit streams to occur not only at the locations of I-frames but also at the locations of the SP-frames. The coding efficiency of an SP-frame is much better than the coding efficiency of a typical I-frame wherein less bandwidth is needed to transmit bit streams having SP-frames in locations where I-frames would be used. The switching of one bit stream into another can be performed at locations in which an SP-frame is placed in the encoded bit stream.