Given the ever-increasing proliferation of digitized video, there is a strong desire to have the same type of controls as one has with devices that play analog video. These controls include Play, Stop, Pause, Fast Forward, Reverse, and Play. Our focus here is on Fast Forward and Reverse Play mode.
Implementing Fast Forward and Reverse Play is trivial for uncompressed digital video, as well as compressed video employing Intra-coding methods (I-Frames) only. In both of these cases, Fast Forward can simply be implemented by skipping frames: e.g. a 3× Fast Forward speed can be achieved simply by decoding and displaying every third frame, but at the full frame rate of the video stream. Similarly, Fast Reverse play at 2× speed can be implemented by decoding and displaying every other frame, but in the reverse direction.
Raw digital video, particularly at high resolution, requires enormous amounts of storage (and bandwidth in case of transmission), however. Even Intra-coded video streams consume far more storage and bandwidth than is available in most applications. As a result, digital video is almost always stored and transmitted using a combination of intra-frame and inter-frame predictive encoding techniques. Inter-frame prediction methods, however, greatly complicate the implementation of Fast Forward and Reverse Play mechanisms.
The major video coding standards developed over the past 20 years (H.261, H.263, MPEG-1, MPEG-2 and MPEG4) are all based on the same basic principles. Each frame of video can be encoded as one of three types: Intra-coded (I) frames, Predicted (P) frames and Bi-directionally predicted (B) frames. The I-frames achieve compression by reducing spatial redundancy. The P-frames are predicted from a preceding I- or P-frame, as shown in FIG. 1. Using motion estimation techniques, each 16×16 MacroBlock (MB) in a P-frame is matched to the closest MB of the frame from which it is to be predicted. The difference between the two MBs is then computed and encoded, along with the motion vectors. As such both temporal and spatial redundancy is reduced. B-frames are coded similar to P-frames except that they are predicted from both past and future I- or P-frames (see FIG. 1).
I-frames are much larger than P or B frames, but they have the advantage of being decodable independent of other frames. P and B frames achieve higher compression ratios, but they depend on the availability of other frames in order to be decoded.
This interdependence between frames has a serious implication for Fast Forward mechanism based on frame skipping: if a frame is skipped, then the next frame cannot be decoded and so on until an I-Frame is reached. Many implementations of Fast Forward work on the principle of transmitting only the I-Frames. I-Frames, however, are often few and far apart (since they consume too many bits), so this technique would only yield a very crude and coarse Fast Forward effect.
A brute force method to produce a Fast Forward effect is to decode and display a video clip faster than its natural frame rate. If a clip is encoded at 30 fps but is transmitted, decoded and displayed at 60 fps, the user will see the clip at twice the natural speed, resulting in a 2× Fast Forward effect. The disadvantages of this technique are twofold: to run the clip at rates that are significantly higher than the standard 30 fps a powerful processor is required, particularly for high resolution images. Moreover, in the case of streaming video, the bandwidth consumption would increase in proportion to the Fast Forward speed: running a 500 kbps clip at four times its natural frame rate would require 2000 kbps of bandwidth. In short, this scheme is not scalable.
In T.-G. Kwon, Y. Choi and S. Lee, “Disk Placement for Arbitrary-Rate Playback in an Interactive Video Server”, Multimedia Systems Journal, Vol. 5, No. 4, pp. 271-281, 1997 and M.-S. Chen, D. Kandlur, P. Yu, “Support for Fully Interactive Playout in a Disk-Array-Based Video Server”, Proceedings of ACM Multimedia '94, pp. 391-398, San Francisco, Calif., October 1994, the authors divide the video clip into independently decodable segments (typically a Group of Pictures or GOP). Fast Forward is then implemented by sampling the segments: 3× Fast Forward, for instance, is achieved by sending every 3rd segment. While, on the average, only one third of the frames are transmitted and displayed, this scheme results in a non-uniform ‘poor man’ s' Fast Forward effect: if a segment is one second long, then the viewer will see a one second clip at normal speed, followed by a jump of two seconds in the video clip.
Perhaps the main advantage of B-frames is that they can be skipped without affecting the decoding of other frames since B-frames are not used as reference frames (except in the case of H.264, as detailed below). Thus, B-frames can be used to achieve temporal scalability, which in turn allows for Fast Forward through frame skipping. For instance, if every other frame in a digital video clip is encoded as a B-frame, then 2× Fast Forward can be achieved by dropping the B-frames, and decoding and playing back the remaining frames at the natural frame rate of the video clip.
In order to achieve a broad range of Fast Forward speeds, however, more and more B-frames have to be used. To achieve both 2× and 4× Fast Forward, 3 out of 4 frames have to be B-frames. To achieve 2×, 4× and 8× speeds, 7 out of 8 frames will have to be encoded as B-frames and so on. As a larger and larger percentage of frames are encoded as B-frames, there will be fewer and fewer reference frames, however, and they will be far in between. This large temporal distance between reference frames (and the B-frames that will use them as references) will result in a large drop in coding efficiency.
The use of B-frames to achieve Fast Forward effect will have the following deficiencies: Encoder complexity is increased due to doubling of Motion estimation process. Compression efficiency is reduced due to larger temporal distances between encoded and reference frames. Many encoders employ profiles (such as MPEG4 SP and H.264 Baseline) that do not even allow the use of B-frames. Encoder latency is increased as more and more B-frames are used in a GOP. This may not matter for off-line encoding, but for cases where the video stream is both viewed live and recorded for archival purposes (such as a videoconferencing session), latency will become an issue.
Finally, as described in A. Srivastava, A. Kumar and A. Singru, “Design and Analysis of a video-on-demand Server”, Multimedia Systems Journal, Vol. 5 No. 4, pp. 238-254, 1997, Fast Forward operation can be achieved by storing multiple versions of the same video clip, each encoded at a different frame rate. When Fast Forward operation is desired, the video server, or client application in case of local files, can switch to a stream encoded at a lower frame rate but transmit (or, in the case of player, decode and display) it at the full frame rate. In H.264/MPEG-4 AVC, the latest video encoding standard, new frame types called SI and SP frames have been introduced. SI and SP frames are ‘switching’ frames which enable seamless switching between two different encoded bitstreams, including multiple versions of the same video. This is described in M. Karcewicz and R. Kurceren, “The SP- and SI-Frames Design for H.264/AVC”, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 13, No. 7, pp. 637-644, July 2003.
Having multiple versions of a stream, however, introduces its own problems: such as increased storage requirements, the need for multiple encoding of the video (not practical for live encoding), and increased complexity at VOD server due to the need to switch between multiple streams, including on the fly generation of the SI and SP frames.