With the increased popularity of DVDs, music delivery over the Internet, and digital cameras, digital media have become commonplace. Engineers use a variety of techniques to process digital audio, video, and images efficiently while still maintaining quality. To understand these techniques, it helps to understand how the audio, video, and image information is represented and processed in a computer.
I. Representation of Media Information in a Computer
A computer processes media information as a series of numbers representing that information. For example, a single number may represent the intensity of brightness or the intensity of a color component such as red, green or blue for each elementary small region of a picture, so that the digital representation of the picture consists of one or more arrays of such numbers. Each such number may be referred to as a sample. For a color image, it is conventional to use more than one sample to represent the color of each elemental region, and typically three samples are used. The set of these samples for an elemental region may be referred to as a pixel, where the word “pixel” is a contraction referring to the concept of a “picture element.” For example, one pixel may consist of three samples that represent the intensity of red, green and blue light necessary to represent the elemental region. Such a pixel type is referred to as an RGB pixel. Several factors affect quality of media information, including sample depth, resolution, and frame rate (for video).
Sample depth is a property normally measured in bits that indicates the range of numbers that can be used to represent a sample. When more values are possible for the sample, quality can be higher because the number can capture more subtle variations in intensity and/or a greater range of values. Resolution generally refers to the number of samples over some duration of time (for audio) or space (for images or individual video pictures). Images with higher spatial resolution tend to look crisper than other images and contain more discernable useful details. Frame rate is a common term for temporal resolution for video. Video with higher frame rate tends to mimic the smooth motion of natural objects better than other video, and can similarly be considered to contain more detail in the temporal dimension. For all of these factors, the tradeoff for high quality is the cost of storing and transmitting the information in terms of the bit rate necessary to represent the sample depth, resolution and frame rate, as Table 1 shows.
TABLE 1Bit rates for different quality levels of raw videoBits Per PixelResolution (inFrame RateBit Rate(sample depth timespixels, Width ×(in frames per(in millions ofsamples per pixel)Height)second)bits per second) 8 (value 0-255,160 × 1207.51.2monochrome)24 (value 0-255, RGB)320 × 2401527.624 (value 0-255, RGB)640 × 48030221.224 (value 0-255, RGB)1280 × 720 601327.1
Despite the high bit rate necessary for storing and sending high quality video (such as HDTV), companies and consumers increasingly depend on computers to create, distribute, and play back high quality content. For this reason, engineers use compression (also called source coding or source encoding) to reduce the bit rate of digital media. Compression decreases the cost of storing and transmitting the information by converting the information into a lower bit rate form. Compression can be lossless, in which quality of the video does not suffer but decreases in bit rate are limited by the complexity of the video. Or, compression can be lossy, in which quality of the video suffers but decreases in bit rate are more dramatic. Decompression (also called decoding) reconstructs a version of the original information from the compressed form. A “codec” is an encoder/decoder system.
In general, video compression techniques include “intra” compression and “inter” or predictive compression. For video pictures, intra compression techniques compress individual pictures. Inter compression techniques compress pictures with reference to preceding and/or following pictures.
II. Multi-resolution Video and Spatial Scalability
Standard video encoders experience a dramatic degradation in performance when the target bit rate falls below a certain threshold. Quantization and other lossy processing stages introduce distortion. At low bitrates, high frequency information may be heavily distorted or completely lost. As a result, significant artifacts can arise and cause a substantial drop in the quality of the reconstructed video. Although available bit rates increase as transmission and processing technology improves, maintaining high visual quality at constrained bit rates remains a primary goal of video codec design. Existing codecs use several methods to improve visual quality at constrained bitrates.
Multi-resolution coding allows encoding of video at different spatial resolutions. Reduced resolution video can be encoded at a substantially lower bit rate, at the expense of lost information. For example, a prior video encoder can downsample (using a downsampling filter) full-resolution video and encode it at a reduced resolution in the vertical and/or horizontal directions. Reducing the resolution in each direction by half reduces the dimensions of the encoded picture size by half. The encoder signals the reduced resolution coding to a decoder. The decoder receives information indicating reduced-resolution encoding and ascertains from the received information how the reduced-resolution video should be upsampled (using an upsampling filter) to increase the picture size before display. However, the information that was lost when the encoder downsampled and encoded the video pictures is still missing from the upsampled pictures.
Spatially scalable video uses a multi-layer approach, allowing an encoder to reduce spatial resolution (and thus bit rate) in a base layer while retaining higher resolution information from the source video in one or more enhancement layers. For example, a base layer intra picture can be coded at a reduced resolution, while an accompanying enhancement layer intra picture can be coded at a higher resolution. Similarly, base layer predicted pictures can be accompanied by enhancement layer predicted pictures. A decoder can choose (based on bit rate constraints and/or other criteria) to decode only base layer pictures at the lower resolution to obtain lower resolution reconstructed pictures, or to decode base layer and enhancement layer pictures to obtain higher resolution reconstructed pictures. When the base layer is encoded at a lower resolution than the displayed picture (also referred to as downsampling), the encoded picture size is actually smaller than the displayed picture. The decoder performs calculations to resize the reconstructed picture and uses upsampling filters to produce interpolated sample values at appropriate positions in the reconstructed picture. However, previous codecs that use spatially scalable video have suffered from inflexible upsampling filters and inaccurate or expensive (in terms of computation time or bit rate) picture resizing techniques.
Given the critical importance of video compression and decompression to digital video, it is not surprising that video compression and decompression are richly developed fields. Whatever the benefits of previous video compression and decompression techniques, however, they do not have the advantages of the following techniques and tools.