Currently, most video content which is available over computer networks such as the Internet must be pre-loaded in a process which can take many minutes over typical modem connections, after which the video quality and duration can still be quite disappointing. In some contexts video streaming is possible, where the video is decompressed and rendered in real-time as it is being received; however, this is limited to compressed bit-rates which are lower than the capacity of the relevant network connections. The most obvious way of addressing these problems would be to compress and store the video content at a variety of different bit-rates, so that individual clients could choose to browse the material at the bit-rate and attendant quality most appropriate to their needs and patience. Approaches of this type, however, do not represent effective solutions to the video browsing problem. To see this, suppose that the video is compressed at bit-rates of R, 2R, 3R, 4R and 5R. Then storage must be found on the video server for all these separate compressed bit-streams, which is clearly wasteful. More importantly, if the quality associated with a low bit-rate version of the video is found to be insufficient, a complete new version must be downloaded at a higher bit-rate; this new bit-stream must take longer to download, which generally rules out any possibility of video streaming.
To enable real solutions to the remote video browsing problem, scalable compression techniques are required. Scalable compression refers to the generation of a bit-stream which contains embedded subsets, each of which represents an efficient compression of the original video with successively higher quality. Returning to the simple example above, a scalable compressed video bit-stream might contain embedded sub-sets with the bit-rates of R, 2R, 3R, 4R and 5R, with comparable quality to non-scalable bit-streams with the same bit-rates. Because these subsets are all embedded within one another, however, the storage required on the video server is identical to that of the highest available bit-rate. More importantly, if the quality associated with a low bit-rate version of the video is found to be insufficient, only the incremental contribution required to achieve the next higher level of quality must be retrieved from the server. In a particular application, a version at rate R might be streamed directly to the client in real-time; if the quality is insufficient, the next rate-R increment could be streamed to the client and added to the previous, cached bit-stream to recover a higher quality rendition in real time. This process could continue indefinitely without sacrificing the ability to display the incrementally improving video content in real time as it is being received from the server.
A major problem, however, is that highly efficient scalable video compression algorithms have not existed, either in practice or in the academic literature. Efficient scalable image compression algorithms have existed for some time, of which the most well known examples are the so-called embedded zero-tree algorithms initially proposed by (J. Shapiro, “An embedded hierarchical image coder using zerotrees of wavelet coefficients”, Data Compression Conference (Snowbird, Utah), PP. 214-223, 1993 and later enhanced by A. Said and W. Pearlman, “A new, fast and efficient image codec based on set partitioning in hierarchical trees” IEEE Trans. Circuits and Systems for Video Technology, vol. 6, PP. 243-250, June 1996. In fact, many of the algorithms advanced for scalable video compression are essentially scalable image compression schemes, applied independently to the successive frames of the video sequence, see S. McCanne, M. Vetterli and V. Jacobson, “Low-complexity video coding for receiver-driven Layered Multicast,” IEEE Journal on Selected Areas in Communications, vol. 15, August 97, PP. 983-1001. In order to compete with the efficiency of non-scalable techniques, however, it is essential that inter-frame redundancy be exploited in a manner which is sensitive to scene and camera motion.
Motion Compensated Prediction (MCP) is by far the most popular approach to exploit inter-frame redundancy for video compression. FIG. 1 illustrates the salient features of MCP compression, upon which key standards such as MPEG-1, MPEG-2, MPEG-4 and H.263, all rely. Rather than compressing each frame of the video sequence separately, the spatial transform, quantization and entropy coding elements common to image compression algorithms are applied to the difference between the current frame and a prediction of the frame formed by applying a motion compensation algorithm to the pixels in the previous frame, as reconstructed by the decompressor.
FIG. 1 is a schematic block diagram of a prior art arrangement for compressing video using a motion compensated prediction (MCP) feedback loop. It will be appreciated that the blocks shown in the diagram are implemented by appropriate hardware and/or software.
Block 1 is a frame delay and motion compensator. This stores the decoded version of a previous frame, using motion information (usually explicitly transmitted with the compressed data stream) to form a prediction of the current frame. The subtractor 2 subtracts the motion compensated predictor, produced by block 1, from the current video frame. The spatial transform block 3 decomposes the prediction residual frame produced by block 2 into separate components for coding. The separate components usually correspond to different spatial frequencies and are less correlated with each other than are the original samples of the prediction residual frame.
The quantisation block 4 approximates each of the transform coefficients, by a number of representative values, identified by labels (usually integers) which are readily coded. This step is a precursor to coding.
Block 5 is an entropy coder which produces a bit-stream which efficiently represents the quantisation labels produced by block 4, and which can be transmitted over a network.
The inverse quantisation block 6 uses the labels produced by block 4 to reconstruct representative values for each of the transform coefficients which were quantised by block 4.
The inverse transform block 7 is the inverse of the spatial transform operator.
Block 8 adds the decoded prediction residual recovered from block 7 to the predictor itself, thereby recovering a copy of the decoded video frame, identical to that which should be available at a decompressor.
MCP relies upon predictive feedback. It requires knowledge of the pixel values which would be reconstructed by the decompressor for previous frames. In a scalable setting, this knowledge is unavailable, because the pixel values reconstructed by the decompressor depend upon the particular subset of the embedded bit-stream which is actually received and decompressed. This problem has been primarily responsible for hampering the development of efficient highly scalable video compressors. MPEG-2 allows some small degree of scalability, but the useful extent of this capability is limited to two or three different quality subsets, and even then with significant loss in efficiency. Although MPEG-4 claims to be highly scalable, this claim is to be interpreted in regard to so-called “object scalability” which is limited to the number of “objects” in the scene, dependent upon appropriate segmentation algorithms and unable to provide smooth increments in video quality as a function of the bandwidth available to a remote client. Otherwise, MPEG-4 is constrained by the fundamental inappropriateness of MCP, upon which it relies to exploit inter-frame redundancy.
Two notable scalable video compression algorithms which have been proposed are those of J. Ohm, “Three dimensional sub-band coding with motion compensation,” IEEE Trans. Image Processing, vol. 3, pp. 559-571, September 1994 and D. Taubman and A. Zakhor “Multi-rate 3-D sub-band coding of video,” IEEE Trans. Image Processing, vol. 3, pp. 572-588, September 1994. In both cases, the idea is to use three-dimensional separable sub-band transforms without any predictive feedback, after first temporally shifting the video frames, or parts thereof, so as to improve the alignment of spatial features prior to application of the 3-D transform. Although these schemes work well for simple global translation, their performance suffers substantially when scene motion is more complex.