Digital video consumes large amounts of storage and transmission capacity. A typical raw digital video sequence includes 15 or 30 frames per second. Each frame can include tens or hundreds of thousands of pixels (also called pels), where each pixel represents a tiny element of the picture. In raw form, a computer commonly represents a pixel as a set of three samples totaling 24 bits. For instance, a pixel may include an eight-bit luminance sample (also called a luma sample, as the terms “luminance” and “luma” are used interchangeably herein) that defines the grayscale component of the pixel and two eight-bit chrominance samples (also called chroma samples, as the terms “chrominance” and “chroma” are used interchangeably herein) that define the color component of the pixel. Thus, the number of bits per second, or bit rate, of a typical raw digital video sequence may be 5 million bits per second or more.
Many computers and computer networks lack the resources to process raw digital video. For this reason, engineers use compression (also called coding or encoding) to reduce the bit rate of digital video. Compression decreases the cost of storing and transmitting video by converting the video into a lower bit rate form. Decompression (also called decoding) reconstructs a version of the original video from the compressed form. A “codec” is an encoder/decoder system. Compression can be lossless, in which the quality of the video does not suffer, but decreases in bit rate are limited by the inherent amount of variability (sometimes called entropy) of the video data. Or, compression can be lossy, in which the quality of the video suffers, but achievable decreases in bit rate are more dramatic. Lossy compression is often used in conjunction with lossless compression—the lossy compression establishes an approximation of information, and the lossless compression is applied to represent the approximation.
As a general rule in video compression, quality varies directly in relation to bit rate. For a given video sequence, if the sequence is encoded at higher quality, the bit rate for the sequence will be higher, and if the sequence is encoded at lower quality, the bit rate for the sequence will be lower. Various factors can affect the bit rate and quality of a raw video sequence, including temporal resolution (e.g., 7.5, 15, 30, or 60 video frames per second), spatial resolution (e.g., 176×144 (QCIF), 352×288 (CIF), or 704×576 (4CIF) pixels per video frame), and sample resolution (e.g., 8, 16, or 24 bits per pixel). Quality and bit rate may be changed by moving temporal, spatial, and/or sample resolution up or down.
Quality and bit rate also depend on the amount of distortion introduced by simplification or removal of information content during lossy compression. This affects, for example, the amount of blurriness, blockiness, graininess, etc. in the video when reconstructed. Stated differently, lossy compression decreases the quality of the sequence so as to allow the encoder to achieve lower bit rates.
As another general rule, quality and bit rate depend on the complexity of a video sequence in terms of detail and motion. For some fixed quality level, a complex sequence typically requires more bits to encode than a simple sequence. The flip side of this is, when encoded at some fixed bit rate, the complex sequence typically has lower quality than the simple sequence.
In some scenarios, encoding video at a single bit rate/quality level is all that is required. For example, if video is being encoded for playback with a single type of device, or if video is being encoded for playback in a point-to-point videoconference over a telephone line, it may be desirable to simply encode the video at a single bit rate/quality level. In many other scenarios, however, encoding video at multiple bit rates and quality levels is desirable. For example, when streaming video over the Internet, a video server often has to provide video to devices with different capabilities and/or deliver video over various kinds of network environments with different speed and reliability characteristics.
One way to address diverse network and playback requirements is to encode the same video sequence at multiple bit rates and quality levels, which can lead to storage and transmission inefficiencies for the multiple independent compressed video bit streams. As an alternative, sub-band or wavelet video encoding provides a way to encode a video sequence in a multi-resolution way in a single, scalable compressed video bitstream. With sub-band or wavelet encoding, a video sequence is decomposed into different temporal and spatial sub-bands.
As a simple example, a video sequence is split into a low resolution temporal sub-band (roughly corresponding to a lower frame rate version of the sequence) and a high resolution temporal sub-band (which can be combined with the low resolution temporal sub-band to reconstruct the original frame rate sequence). Information for an individual video frame may similarly be split into a low resolution spatial sub-band and multiple higher resolution spatial sub-bands. Temporal and spatial decomposition may be used together. Either type of decomposition may be repeated, for example, such that a low resolution sub-band is further decomposed. By selecting particular sub-bands for transmission or decoding at different resolutions, temporal and spatial scalability can be implemented.
In addition, information for an individual sub-band may be represented as a bit plane with multiple layers of bit resolution. Fidelity to the original encoded information can be selectively reduced (along with bit rate) by transmitting some, but not all, of the bits for the sub-band. Or, fidelity can be selectively reduced (along with processing requirements) by decoding less than all of the bits for the sub-band.
Although scalable video coding and decoding techniques facilitate various spatial, temporal, and bit fidelity scalabilities of a compressed bit stream, there are several shortcomings to existing scalable video coding and decoding techniques.
Existing scalable video coding and decoding techniques typically do not provide performance that is competitive with non-scalable techniques at low bit rates. While the performance of scalable video coding and decoding techniques is good at higher bit rates and qualities, they use too many bits at low bit rates compared to non-scalable video coding and decoding techniques.
Moreover, many existing hardware and software tools were designed according to specific non-scalable video coding and decoding techniques. Users of such tools may be reluctant to invest in new scalable video coding and decoding techniques and tools that are incompatible with existing tools. Moreover, content providers may be reluctant to produce encoded content that is incompatible with the prevailing installed base of video decoding tools.
Sometimes, a decoder plays back video at a spatial resolution lower than the original spatial resolution. This might occur, for example, if a decoder device has only a small screen or if higher spatial resolution information is dropped by a network. Decoding at the lower spatial resolution is problematic, however, when temporal decomposition occurs at the original spatial resolution during encoding. Existing scalable video decoding techniques fail to adequately address this decoding scenario.
Finally, existing scalable video coding and decoding techniques fail to account for the perceptibility of distortion in certain decisions during encoding and decoding. Specifically, existing scalable video coding techniques introduce an excessive amount of perceptible distortion in low resolution temporal sub-bands in some kinds of temporal decomposition.
Given the critical importance of compression and decompression to digital video, it is not surprising that scalable video coding and decoding are richly developed fields. Whatever the benefits of previous scalable video coding and decoding techniques, however, they do not have the advantages of the following techniques and tools.