1. Brief Introduction
As more communication requires video, such as real-time streaming of video, video conferencing, digital television, interactive television and Internet-based communications such as hypertext transport of World Wide Web (WWW) content, more efficient ways of utilizing existing bandwidth are needed. This is because the typical bandwidth allocated to a particular transmission mode (e.g., broadcast, cable, telephone lines, etc.) is much less than the bandwidth typically required for a video stream. Thus, if such modes are to carry video, compression is needed. Compression is also needed where the video is stored, so that storage capacity is efficiently used. The advent of multi-media capabilities on most computer systems has taxed traditional storage devices, such as hard drives, to their limits.
Compression allows digitized video sequences to be represented efficiently, allowing more video to be transmitted in a given amount of time over a given channel, or more video to be stored in a given storage medium. Compression does this by reducing the bitstream, or video information flow, of the video sequences at a transmitter (which can be placing the bitstream into a channel or storing into a storage medium) while retaining enough information that a decoder or receiver at the other end of the channel or reading the storage medium can reconstruct the video in a manner adequate for the specific application, such as television, videoconferencing, etc.
Video is typically represented by a sequence of images, called “frames” or “video frames” that, when played in sequence, present the video. As used herein, a video stream might refer to a video and audio stream, where the audio is included with the video. However, for simplicity, just the video compression is often described.
As the terms are used herein, an image is data derived from a multi-dimensional signal. The signal might be originated or generated either naturally or artificially. This multi-dimensional signal (where the dimension could be one, two, three, or more) may be represented as an array of pixel color values such that pixels placed in an array and colored according to each pixel's color value would represent the image. Each pixel has a location and can be thought of as being a point at that location or as a shape that fills the area around the pixel such that any point within the image is considered to be “in” a pixel's area or considered to be part of the pixel. The image itself might be a multidimensional pixel array on a display, on a printed page, an array stored in memory, or a data signal being transmitted and representing the image. The multidimensional pixel array can be a two-dimensional array for a two-dimensional image, a three-dimensional array for a three-dimensional image, or some other number of dimensions.
The image can be an image of a physical space or plane or an image of a simulated and/or computer-generated space or plane. In the computer graphic arts, a common image is a two-dimensional view of a computer-generated three-dimensional space (such as a geometric model of objects and light sources in a three-space). An image can be a single image or one of a plurality of images that, when arranged in a suitable time order, form a moving image, herein referred to as a video sequence.
Pixel color values can be selected from any number of pixel color spaces. One color space in common use is known as the YUV color space, wherein a pixel color value is described by the triple (Y, U, V), where the Y component refers to a grayscale intensity or luminance, and U and V refer to two chrominance components. The YUV color space is commonly seen in television applications. Another common color space is referred to as the RGB color space, wherein R, G and B refer to the Red, Green and Blue color components, respectively. The RGB color space is commonly seen in computer graphics representations, along with CYMB (cyan, yellow, magenta, and black) often used with computer printers.
Video compression is possible because an uncompressed video sequence contains redundancies and some of the video signal can be discarded without greatly affecting the resulting video. For example, each frame of a video sequence representing a stationary scene would be nearly identical to other frames in the video sequence. Most video compression routines attempt to remove the superfluous information so that the related image frames can be represented in terms of previous image frame(s), thus eliminating the need to transmit an entire image for each video frame. Alternatively, routines like motion JPEG, code each video frame separately and ignore temporal redundancy.
2. Known Compression Techniques
There have been numerous attempts at adequately compressing video imagery. These methods generally fall into the following two categories: 1) spatial redundancy reduction, and 2) temporal redundancy reduction.
2.1. Spatial Redundancy Reduction
Spatial redundancy reduction takes advantage of the correlation among neighboring pixels in order to derive a more efficient representation of the important information in an image frame. These methods are more appropriately termed still-image compression routines, as they generally address each frame in isolation, i.e., independent of other frames in the sequence. Because of this, they do not attempt to temporal, or frame-to-frame, redundancy. Common still-image compression schemes include JPEG, wavelets, and fractals.
2.1.1. JPEG/DCT Based Image Compression
One of the first commonly used methods of still-image compression was the direct cosine transformation (“DCT”) compression system, which is at the heart of JPEG. DCT operates by representing each digital image frame as a series of cosine waves or frequencies and quantizing coefficients of the cosine series. The higher frequency coefficients are quantized more harshly than those of the lower frequencies. The result of the quantization is a large number of zero coefficients, which can be encoded very efficiently. However, JPEG and similar compression schemes do not address the crucial issue of temporal redundancy.
2.1.2. Wavelets
As a slight improvement to the DCT compression scheme, the wavelet transformation compression scheme was devised. This system is similar to the DCT, differing mainly in that an image frame is represented as a series of wavelets, or windowed oscillations, instead of as a series of cosine waves.
2.1.3. Fractals
Another technique is known as fractal compression. The goal of fractal compression is to take an image and determine a single function, or a set of functions, which fully describe(s) the image frame. A fractal is an object that is self-similar at different scales or resolutions, i.e., no matter what resolution one looks at, the object remains the same. In theory, where fractals allow simple equations to describe complex images, very high compression ratios should be achievable.
Unfortunately, fractal compression is not a viable method of general compression. The high compression ratios are only achievable for specially constructed images, and only with considerable help from a person guiding the compression process. In addition, fractal compression is very computationally intensive.
2.2. Temporal and Spatial Redundancy Reduction
Adequate motion video compression requires reduction of both temporal and spatial redundancies. Temporal redundancy can be reduced by replacing all or part of the bits representing the image of a frame with one or more references to other frames or portions of a frame. This allows a small number of bits to represent a larger number of bits. Block matching is the basis for most currently used effective means of temporal redundancy removal.
In block matching, an image frame is subdivided into uniform size blocks (more generally, into polygons), and each block is tracked from one frame to another and represented by a motion vector, instead of having the block re-coded and placed into the bitstream for a second time. Examples of compression routines that use block matching include MPEG and variants thereof
MPEG encodes the first frame in a sequence of related frames in its entirety as a so-called intra-frame, or I-frame. An I-frame is a type of key frame, meaning an image frame that is completely self-contained and not described in relation to any other image frame. To create an I-frame, MPEG performs a still-image compression on the frame, including dividing the frame into 16 pixel by 16 pixel square blocks. Other (so-called “predicted”) frames are encoded with respect to the I-frame by predicting corresponding blocks of the other frame in relation to that of the I-frame. That is, MPEG attempts to find each block of an I-frame within the other frame. For each block that still exists in the other frame, MPEG transmits the motion vector, or movement, of the block along with block identifying information. However, as a block moves from frame to frame, it may change slightly. The difference relative to the I-frame is known as residue. Additionally, as blocks move, previously hidden areas may become visible for the first time. These previously hidden areas are also known as residue. That is, the collective remaining information after the block motion is sent is known as the residue, which is coded using JPEG and included in the bitstream to complete the image frame.
Subsequent frames are predicted with respect to either the blocks of the I-frame or a preceding predicted frame. In addition, the prediction can be bi-directional, i.e., with reference to both preceding and subsequent I-frames or predicted frames. The prediction process continues until a new key frame is inserted, at which point a new I-frame is encoded and the process repeats itself.
Although state of the art, block matching is highly inefficient and fails to take advantage of the known general physical characteristics or other information inherent in the images. The block method is both arbitrary and crude, as the blocks do not have any relationship with real objects in the image. A given block may comprise a part of an object, a whole object, or even multiple dissimilar objects with unrelated motion. In addition, neighboring objects will often have similar motion. However, since blocks do not correspond to real objects, block-based systems cannot use this information to further reduce the bitstream.
Yet another major limitation of block-based matches arises because the residue created by block-based matching is generally noisy and patchy. Thus, block-based residues do not lend themselves to good compression via standard image compression schemes such as DCT, wavelets, or fractals.
2.3. Alternatives
It is well recognized that the state of the art needs improvement, specifically in that the block-based method is extremely inefficient and does not produce an optimally compressed bitstream for motion video information. To that end, the very latest compression schemes, such as MPEG4, allow for the inclusion of limited structural information, if available, of selected items within the frames rather than merely using arbitrary-sized blocks. While some compression gains are achieved, the associated overhead information is substantially increased because, in addition to the motion and residue information, these schemes require that structural or shape information for each object in a frame must also be sent to the receiver.
Additionally, as mentioned above, the current compression methods treat the residue as just another image frame to be compressed by JPEG using a fixed compression.