1. Field of the Invention
The present invention relates generally to image and video processing and more particularly to the use of keyframes during video encoding and decoding.
2. Description of the Background Art
For a variety of reasons, video data (i.e., data representative of a sequence of video image frames) often requires compression. The compression may be needed to comply with bandwidth constraints, storage constraints, or other constraints.
As an example of a bandwidth constraint, a viewer might want to receive a video stream over an Internet connection having limited bandwidth at some point between the video source and the viewing device. Where the connection to the viewing device has less bandwidth than is required for uncompressed video (such as a 380 Kilobit per second DSL line trying to download a 4 Megabit per second DVD quality movie) or where the allotted bandwidth must be shared among many devices (such as a broadband channel used for many simultaneous video-on-demand sessions) or among many applications (such as e-mail, file downloads and web access), the video data would need to be compressed if the video data is to be received at the receiver in a timely manner.
Applications for compressed video over limited bandwidth include video streaming over the Internet, video conferencing, and digital interactive television. Satellite broadcasting and digital terrestrial television broadcasting are also examples of how bandwidth limitations can be dealt with using video compression. For example, using half the bandwidth allows one to double the number of channels broadcast on a satellite television network. Alternatively, using half the bandwidth may reduce the cost of these systems considerably.
Storage for video data might also be constrained. For example, a video sequence might need to be stored on a hard disk where the storage space required for uncompressed video is greater than the size of the available storage on the hard disk. Examples of devices requiring video storage include video-on-demand servers, satellite video sources, personal video recorders (“PVRs”, often referred to as “digital VCRs”), and personal computers. Other digital storage media can be used for video storage, such as DVDs, CDs and the like.
Compression allows video to be represented with fewer bits or symbols than the corresponding uncompressed video. It should be understood that a video sequence can include audio as well as video information, but herein compression is often discussed with reference to manipulation of just the video portion of such information. When video (or any other data) is compressed, it can be transmitted using less bandwidth and/or less channel time and it can be stored using less storage capacity. Consequently, much effort has gone into compression methods that achieve high compression ratios with good results. A compression ratio is the ratio of the size (in bits, symbols, etc.) of uncompressed data to the corresponding compressed data. Compression where the data can only be recovered approximately is referred to as “lossy” compression, as opposed to perfectly recoverable, or “lossless,” compression.
A compression system typically includes an encoder, a decoder and a channel for transmitting data between the two. In the case of a transmission system, the encoder encodes uncompressed data and transmits compressed data over the channel to the decoder, which then decompresses the received compressed data to recover the uncompressed data, either exactly (lossless) or approximately (lossy). Presumably, the channel has a limited available bandwidth requiring compression to handle the volume of data, but a limited channel is not required for compression to be used. In the case of a storage system, the encoder encodes uncompressed data and stores the compressed data in storage. When the data is needed (or at other times), the decoder recovers the uncompressed data (exactly or approximately) from the compressed data in storage. In either case, it should be understood that for compression to work, the encoder must convey via the compressed data enough information to allow the decoder to, at least approximately, reconstruct the original data.
A video sequence is often represented by a set of frames wherein each frame is an image and has a time element. The video sequence can be viewed by displaying each frame at the time indicated by its time element. For example, the first frame of a video sequence might be given a time element of 00:00:00:00 and the next frame given a time element of 00:00:00:01, where for example the rightmost two digits in the time element represent increments of 1/30th of a second (and the other pairs of digits may represent hours, minutes, and seconds). Where the video sequence is a digitized, two-dimensional sequence, each frame can be represented by a set of pixels, where each pixel is represented by a pixel color value and a location in a (virtual or otherwise) two-dimensional array of pixels. Thus, an uncompressed video sequence can be fully represented by a collection of data structures for frames, with a data structure for a frame comprising pixel color values for each pixel in the frame. In a typical application, a pixel color value might be represented by 24 bits of data, a frame represented by a 1024×768 array of pixels, and one second of video represented by 30 frames. In that application, 24×1024×768×30=566,231,040 bits (or approximately 71 megabytes) are used to represent one second of video. Clearly, when video sequences of significant length are desired, compression is useful and often necessary.
Most video compression schemes attempt to remove redundant information from the video data. Video sequences will often have temporal redundancy and spatial redundancy. Temporal redundancy occurs when the scenery (e.g., the pixel color values) is the same or similar from frame to frame. Spatial redundancy occurs when the pixel color values repeat (or are similar) within a frame. Most video signals contain a substantial amount of redundant information. For example, in a television news broadcast, only parts of the head of the speaker change significantly from frame to frame and most objects in the background remain stationary. If the scene is two seconds long, the sequence may well contain sixty repetitions of the representations of stationary portions of the scene.
In addition to eliminating redundancy, some video compression schemes also seek to eliminate superfluous information, such as information that is present in the uncompressed video but which can be eliminated without altering the video sequence enough to impair its visual quality. For example, some high spatial frequency effects can be eliminated from many video sequences, allowing for greater compression ratios, without substantially reducing the quality of the video sequence.
Spatial redundancy can be analyzed and reduced on a frame by frame basis (i.e., without needing to take into account other frames) using what is often referred to as “still-image compression,” since the processes used to compress single still images can be used. Examples of existing stilt-image compression include the Joint Photographic Experts Group (JPEG) standard, wavelet compression, and fractal compression. Quite often, reduction of spatial redundancy alone is not sufficient to get to desirable compression ratios for video. Additionally, features that are lost in the compression of some frames may appear in other frames resulting in flickering as features appear and disappear as each frame is displayed.
A common approach to reduction of temporal redundancy is to include a still image compression of a reference frame in the compressed data, followed by information for one or more subsequent frames conveying the differences between each subsequent frame and the reference frame. The reference frame is said to be “intra-coded” while subsequent frames are said to be “predicted.” Intra-coded frames are often called “I-frames” or “keyframes,” while predicted frames are sometimes referred to as “P-frames.” Periodically, or according to some rule, a new keyframe is generated and used as the comparison for later subsequent frames. In some cases, subsequent predicted frames may not reference a keyframe directly but may instead reference previous predicted frames. Additionally, some predicted frames may reference P-frames or I-frames that occur either previously or subsequently in the sequence. Such bi-directionally predicted frames are commonly referred to as “B-frames” to distinguish them from “P-frames,” which are predicted from one direction only.
One approach to representing a predicted frame with fewer bits or symbols is block matching, a form of temporal redundancy reduction in which blocks of pixels in the predicted frame are compared with blocks of pixels in the referenced frame(s) and the compressed predicted frame is represented by indications of matching blocks rather than pixel color values for each pixel in the predicted frame. With block matching, the predicted frame is subdivided into blocks (more generally, into polygons), and each block is tracked between the predicted frame and the referenced frame(s) and represented by a motion vector. When more than one referenced frame is used and the referenced frame cannot be identified by context, the predicted frame might be represented by both a motion vector and an indication of the applicable referenced frame for each constituent block. A motion vector for a block in an N-dimensional video frame typically has N components, one in each coordinate space, where each component represents the offset between the block in a referenced frame and a predicted frame, but a motion vector can be any other suitable form of representation, whether or not it falls within the mathematical definition of a vector.
The MPEG standards, created by the Moving Pictures Experts Group, and their variants are examples of compression routines that use block matching. An MPEG encoder encodes the first frame in its input sequence in its entirety as an intra-frame, or I-frame, using still-image compression. The intra-frame might be compressed by having the frame divided into 16 pixel by 16 pixel blocks and having each of those blocks encoded. A predicted frame is then encoded by indicating matching blocks, where a block in the predicted frame matches a block in the intra-frame and motion vectors are associated with those blocks.
In most cases, a predicted frame cannot be reconstructed just from knowledge of the referenced frame(s), block matches and motion vectors. A coarse approximation of the predicted frame might be reconstructed by starting with a blank image and copying each matching block from a referenced frame, shifting the relative position of each block according to the associated motion vector. However, gaps will remain where pixels of the predicted frame did not match any block in the reference frame(s) and differences might still exist where the blocks did not match exactly. Gaps are to be expected, such as where the scene captured in the video sequence is of a first object passing in front of a second object. If the second object is occluded in the referenced frame but not in the predicted frame, then there will be no matching information in the referenced frame that would allow for reconstruction of the predicted frame pixels that are associated with the second object.
One way to handle such problems is to run the block-matching process, determine what is left out and encode that as “residue”. For example, a predicted frame can be encoded as a set of block elements, where each block element represents a block from a referenced frame and an associated motion vector, and a residue correcting the pixels of the predicted frame that are not represented (or are not represented correctly enough) by the block information. In MPEG encoding, the residue is encoded using JPEG.
Block matching is suboptimal in that it fails to take advantage of known physical characteristics or other information inherent in the images. The block method is both arbitrary and inexact, as the blocks generally do not have any relationship with real objects in the scene represented by the image. For example, a given block may comprise a part of an object, a whole object, or even multiple dissimilar objects with unrelated motion. Additional inefficiencies occur because the resultant residues for block-based matching are generally noisy and patchy, making them difficult to compress.
Segmentation followed by segment matching often provides better compression ratios than block matching because segments can be encoded more tightly than arbitrary blocks and segment matching leaves less of a residue. As used herein, a “segment” refers to a representation (or designation) of a set of pixels of an image, and a region of the image might also be referred to as a segment. Typically, a “segment” refers to a representation (or designation) of a set of pixels of an image where the pixels within a given segment have color values that are within a narrow range of variation and where pixels typically have wider variations across segment boundaries. Thus, dividing an image into segments of variable sizes and shapes allows for truer representations of image objects and thus eliminates many of the inefficiencies associated with block-based compression.
Another patent application in the same general technology area is U.S. patent application Ser. No. 09/550,705, filed Apr. 17, 2000 and titled “Method and Apparatus for Efficient Video Processing” (hereinafter “Prakash I”). Prakash I discusses a method for compressing a video sequence using segmentation. As part of the encoding process, motion vectors are calculated that represent displacements of segments from one image frame to a subsequent image frame. These motion vectors are then included in the compressed data so that a decoder can use the information to reconstruct the second image frame. Segmentation information need not be included in the compressed data if the decoder can extract the segmentation information from other data. For example, the decoder can extract segmentation information by segmenting a keyframe (or another predicted frame that the decoder has already reconstructed). Preferably, the encoder uses the same segmentation process as the decoder. For a further discussion, please refer to Prakash I. With segmentation and segment matching, a predicted frame can be represented by a set of segment matches, wherein each segment match references a segment of a referenced frame and a motion vector indicating the offset of the segment between the referenced frame and the predicted frame.
In both block-based and segment-based compression strategies, keyframes are used as reference points for subsequent predicted frames. A typical arrangement of I-frames, P-frames, and B-frames, as for instance may appear in an MPEG-encoded video sequence, is I1, B1, B2, P1, B3, B4, P2, B5, B6, P3, B7, B8, P4, B9, B10, I2, . . . . I1 is used to predict P1, P1 is used to predict P2, and so on, and the B-frames lying in between are predicted bi-directionally from the nearest I- or P-frames. Because of the dependencies inherent in this prediction order, this sequence must actually be decompressed in the order I1, P1, B1, B2, P2, B3, B4, P3, B5, B6, P4, B7, B8, I2, B9, B10, . . . . A set of consecutive frames that are predicted relative to a single keyframe is commonly referred to as a group of pictures (GOP).