For many reasons, video data (i.e., data representative of a sequence of video image frames) often requires compression. The compression may be needed to comply with bandwidth constraints, storage constraints, or other constraints.
As an example of a bandwidth constraint, a viewer might want to receive a video stream over an Internet connection having limited bandwidth at some point between the video source and the viewing device. For instance, where the connection to the viewing device has less bandwidth than is required for uncompressed video (such as a 380 Kilobit per second DSL line trying to download a 4 Megabit per second DVD quality movie), the video data would need to be compressed if the video data is to be received at the receiver in a timely manner. Similarly, where the allotted bandwidth must be shared among many devices (such as a broadband channel used for many simultaneous video-on-demand sessions) or among many applications (such as e-mail, file downloads and web access), the video data also would need to be compressed if the video data is to be received at the receiver in a timely manner.
Applications for compressed video over limited bandwidth include, for example, video streaming over the Internet, video conferencing, and digital interactive television. Satellite broadcasting and digital terrestrial television broadcasting are also examples of how bandwidth limitations can be dealt with using video compression. For instance, using half the bandwidth allows one to double the number of channels broadcast on a satellite television network. Alternatively, using half the bandwidth may reduce the cost of these systems considerably.
Storage for video data may also be constrained. For example, a video sequence may need to be stored on a hard disk where the storage space required for uncompressed video is greater than the size of the available storage on the hard disk. Examples of devices requiring video storage include video-on-demand servers, satellite video sources, personal video recorders (“PVR's”, often referred to as “digital VCRs”), and personal computers. Other digital storage media can be used for video storage, such as DVD's, CD's and the like.
Compression allows video to be represented with fewer bits or symbols than the corresponding uncompressed video. It should be understood that a video sequence can include audio as well as video information, but herein compression is often discussed with reference to manipulation of just the video portion of such information. When video (or any other data) is compressed, it can be transmitted using less bandwidth and/or less channel time and it can be stored using less storage capacity. Consequently, much effort has gone into compression methods that achieve high compression ratios with good results.
A compression ratio is the ratio of the size (in bits, symbols, etc.) of uncompressed data to the corresponding compressed data. One constraint on getting higher and higher compression ratios is that the uncompressed data must be recoverable from the compressed data in a decompression process. When the uncompressed data need only be recovered approximately, which is often the case with video, higher compression ratios are possible. Compression where the data can only be recovered approximately is referred to as “lossy” compression, as opposed to perfectly recoverable, or “lossless,” compression. Unless expressly mentioned, compression as used herein can refer to either lossy or lossless compression and is usually dictated by the application.
A compression system typically includes an encoder, a decoder and a channel for transmitting data between the two. In the case of a transmission system, the encoder encodes uncompressed data and transmits compressed data over the channel to the decoder, which then decompresses the received compressed data to recover the uncompressed data, either exactly (lossless) or approximately (lossy). Presumably, the channel has a limited available bandwidth requiring compression to handle the volume of data, but a limited channel is not required for compression to be used. In the case of a storage system, the encoder encodes uncompressed data and stores the compressed data in storage. When the data is needed (or at other times), the decoder recovers the uncompressed data (exactly or approximately) from the compressed data in storage. In either case, it should be understood that for compression to work, the encoder must convey via the compressed data enough information to allow the decoder to, at least approximately, reconstruct the original data.
A video sequence is often represented by a set of frames wherein each frame is an image and has a time element. The video sequence can be viewed by displaying each frame at the time indicated by its time element. For example, the first frame of a video sequence might be given a time element of 00:00:00:00 and the next frame given a time element of 00:00:00:01, where for example the rightmost two digits in the time element represent increments of 1/30th of a second (the other pairs of digits may represent hours, minutes, and seconds). Where the video sequence is a digitized, two-dimensional sequence, each frame can be represented by a set of pixels, where each pixel is represented by a pixel color value and a location in a (virtual or otherwise) two-dimensional array of pixels. Thus, an uncompressed video sequence can be fully represented by a collection of data structures for frames, with a data structure for a frame comprising pixel color values for each pixel in the frame. In a typical application, a pixel color value might be represented by 24 bits of data, a frame represented by a 1024×768 array of pixels, and one second of video represented by 30 frames. In that application, 24×1024×768×30=566,231,040 bits (or approximately 71 megabytes) are used to represent one second of video. Clearly, when video sequences of significant length are desired, compression is useful and often necessary.
Most video compression schemes attempt to remove redundant information from the video data. Video sequences will often have temporal redundancy and spatial redundancy. Temporal redundancy occurs when the scenery (e.g., the pixel color values) is the same or similar from frame to frame. Spatial redundancy occurs when the pixel color values repeat (or are similar) within a frame. Most video signals contain a substantial amount of redundant information. For example, in a television news broadcast, only parts of the head of the speaker change significantly from frame to frame and most objects in the background remain stationary. If the scene is two seconds long, the sequence may well contain sixty repetitions of the representations of stationary portions of the scene.
In addition to eliminating redundancy, some video compression schemes also seek to eliminate superfluous information, such as information that is present in the uncompressed video but which can be eliminated without altering the video sequence enough to impair its visual quality. For example, some high spatial frequency effects can be eliminated from many video sequences, allowing for greater compression ratios, without substantially reducing the quality of the video sequence.
Spatial redundancy can be analyzed and reduced on a frame by frame basis (i.e., without needing to take into account other frames) using what is often referred to as “still-image compression,” since the processes used to compress single still images can be used. Examples of existing still-image compression include the Joint Photographic Experts Group (JPEG) standard, wavelet compression and fractal compression. Quite often, reduction of spatial redundancy alone is not sufficient to get to desirable compression ratios. Additionally, features that are lost in the compression of some frames may appear in other frames resulting in flickering as features appear and disappear as each frame is displayed.
A common approach to reduction of temporal redundancy is to include a still image compression of a reference frame in the compressed data, followed by information for one or more subsequent frames conveying the differences between each subsequent frame and the reference frame. The reference frame is said to be “intra-coded” while subsequent frames are said to be “predicted.” Intra-coded frames are often called “I-frames,” while predicted frames are commonly referred to as “P-frames.” Periodically, or according to some rule, a new reference frame is generated and used as the comparison for later subsequent frames. In some cases, the time element for the reference frame is always earlier than the time element for subsequent frames that reference the reference frame, but in other cases, a subsequent frame can reference frames before or after the subsequent frame. Of course, where the subsequent frame references a frame that comes after, recovery of the subsequent frame might be delayed until the later frame is recovered. Furthermore, subsequent frames may not reference an intra-coded frame directly but may instead reference previous or subsequent predicted frames.
One approach to representing a predicted frame with fewer bits or symbols is block matching, a form of temporal redundancy reduction in which blocks of pixels in the predicted frame are compared with blocks of pixels in the referenced frame(s) and the compressed predicted frame is represented by indications of matching blocks rather than pixel color values for each pixel in the predicted frame. With block matching, the predicted frame is subdivided into blocks (more generally, into polygons), and each block is tracked between the predicted frame and the referenced frame(s) and represented by a motion vector. When more than one referenced frame is used and the referenced frame cannot be identified by context, the predicted frame might be represented by both a motion vector and an indication of the applicable referenced frame for each constituent block. A motion vector for a block in an N-dimensional video frame typically has N components, one in each coordinate space, where each component represents the offset between the block in a referenced frame and a predicted frame, but a motion vector can be any other suitable form of representation, whether or not it falls within the mathematical definition of a vector.
The MPEG standards, created by the Moving Pictures Experts Group, and their variants are examples of compression routines that use block matching. An MPEG encoder encodes the first frame in its input sequence in its entirety as an intra-frame, or I-frame, using still-image compression. The intra-frame might be compressed by having the frame divided into 16 pixel by 16 pixel blocks and having each of those blocks encoded. A predicted frame is then encoded by indicating matching blocks, where a block in the predicted frame matches a block in the intra-frame and motion vectors are associated with those blocks.
In most cases, a predicted frame cannot be reconstructed just from knowledge of referenced frames, block matches and motion vectors. A coarse approximation of the predicted frame might be reconstructible by starting with a blank image and copying each matching block from a referenced frame, shifting the relative position of the block according to the associated motion vector. However, gaps will remain where pixels of the predicted frame did not match any block in the reference frame(s) and differences might still exist where the blocks did not match exactly. Gaps are to be expected, such as where the scene captured in the video sequence is of a first object passing in front of a second object. If the second object is occluded in the referenced frame but not in the predicted frame, then there will be no matching information in the referenced frame that would allow for reconstruction of the predicted frame pixels that are associated with the second object.
One way to handle such problems is to run the block-matching process, determine what is left out and encode that as “residue”. For example, a predicted frame can be encoded as a set of block elements, where each block element represents a block from a referenced frame and an associated motion vector, and a residue correcting the pixels of the predicted frame that are not represented (or are not represented correctly enough) by the block information. In MPEG encoding, the residue is encoded using JPEG.
Block matching is suboptimal in that it fails to take advantage of known physical characteristics or other information inherent in the images. The block method is both arbitrary and inexact, as the blocks generally do not have any relationship with real objects in the scene represented by the image. For example, a given block may comprise a part of an object, a whole object, or even multiple dissimilar objects with unrelated motion. Additional inefficiencies occur because the resultant residues for block-based matching are generally noisy and patchy, making them difficult to compress.
Segmentation followed by segment matching often provides better compression ratios than block matching because segments can be encoded more tightly than arbitrary blocks and segment matching leaves less of a residue. As used herein, a “segment” refers to a representation (or designation) of a set of pixels of an image, and a region of the image might also be referred to as a segment. Typically, a “segment” refers to a representation (or designation) of a set of pixels of an image where the pixels within a given segment have color values that are within a narrow range of variation and where pixels typically have wider variations across segment boundaries. Thus, dividing an image into segments of variable sizes and shapes allows for truer representations of image objects and thus eliminates many of the inefficiencies associated with block-based compression.
Previous patent applications in this general area of technology include U.S. patent application Ser. No. 09/550,705, filed Apr. 17, 2000, and entitled “Method and Apparatus for Efficient Video Processing” (hereinafter referred to as “Prakash I”) and U.S. patent application Ser. No. 09/591,438, filed Jun. 9, 2000, and entitled “Method and Apparatus for Digital Image Segmentation” (hereinafter referred to as “Prakash II”). Prakash I and Prakash II discuss an encoding process including the segmentation of an image frame into such image components. As part of the encoding process, motion vectors are calculated that represent displacements of segments from one image frame to a subsequent image frame. These motion vectors are then included in the compressed data so that a decoder can use the information to reconstruct the second image frame.
Segmentation information need not be included in the compressed data if the decoder can extract the segmentation information from other data. For example, the decoder can extract segmentation information by segmenting an I-frame (or another predicted frame that the decoder has already reconstructed). Preferably, the encoder uses the same segmentation process as the decoder. For a further discussion, please refer to Prakash I and Prakash II. With segmentation and segment matching, a predicted frame can be represented by a set of segment matches, wherein each segment match references a segment of a referenced frame and a motion vector indicating the offset of the segment between the referenced frame and the predicted frame. While segmentation followed by segment matching provides considerable compression ratios, some redundancy may remain. Attention should be paid to the faithfulness of the segmentation in representing real image objects in order to realize better compression ratios.