Video images have become an increasingly important part of communications in general. The ability to nearly instantaneously transmit still images, and particularly, live moving images, have greatly enhanced global communications.
In particular, videoconferencing systems have become an increasingly important business communication tool. These systems facilitate meetings between persons or groups of persons situated remotely from each other, thus eliminating or substantially reducing the need for expensive and time-consuming business travel. Since videoconference participants are able to see facial expressions and gestures of remote participants, richer and more natural communication is engendered. In addition, videoconferencing allows sharing of visual information, such as photographs, chars, and figures, and may be integrated with personal computer applications to produce sophisticated multimedia presentations.
To provide cost-effective video communication, the bandwidth required to convey video must be limited. The typical bandwidth used for videoconferencing lies in the range of 128 to 1920 kilobits per second (Kbps). Problems associated with available videoconferencing systems as these systems attempt to cope with bandwidth limitations include slow frame rates, which results in a non-lifelike picture having an erratic, jerky motion; use of small video frames or limited spatial resolution of a transmitted video frame; and a reduction in the signal-to-noise ratio of individual video frames. Conventionally, if solutions such as reduced video frame size or limited spatial resolutions are not employed, higher bandwidths are required.
At 768 Kbps, digital videoconferencing, using state-of-the-art video encoding methods, produces a picture that may be likened to a scene from analog television. Typically, for most viewers, 24 frames per second (fps) are required to make video frames look fluid and give the impression that motion is continuous. As the frame rate is reduced below 24 fps, an erratic motion results. In addition, there is always a tradeoff between a video frame size required and available network capacity. Therefore, lower bandwidth requires a lower frame rate and/or reduced video frame size.
A standard video format used in videoconferencing, defined by resolution, is Common Intermediate Format (CIF). The primary CIF format is also known as Full CIF or FCIP. The International Telecommunications Union (ITU), based in Geneva, Switzerland (www.itu.ch), has established this communications standard. Additional standards with resolutions higher and lower than CIF have also been established. Resolution and bit rate requirements for various formats are shown in Table I below. The bit rates (in megabits per second, Mbps) shown are for uncompressed color frames where 12 bits per pixel is assumed.
TABLE IResolution and bit-rates for various CIF formatsBit Rate at 30 fpsResolutionCIF Format(in pixels)MbpsSQCIF (Sub Quarter CIF)128 × 96 4.424QCIF (Quarter CIF)176 × 1449.124CIF (Full CIF, FCIF)352 × 28836.504CIF (4 × CIF)704 × 576146.016CIF (16 × CIF)1408 × 1152583.9
Video compression is a means of encoding digital video to take up less storage space and reduce required transmission bandwidth. Compression/decompression (CODEC) schemes are frequently used to compress video frames to reduce required transmission bit rates. Overall, CODEC hardware or software compresses digital video into a smaller binary format than required by the original (i.e., uncompressed) digital video format.
H.263 is a document which described a common contemporary CODEC scheme, requiring a bandwidth from 64 to 1920 Kbps. H.263 is an ITU standard for compressing video and is generically known as a lossy compression method. Lossy coding assumes that some information can be discarded, which results in a controlled degradation of the decoded signal. The lossy coding method is designed to gradually degrade as a progressively lower bit rate is available for transmission. Thus, the use of lossy compression methods results in a loss of some of the original image information during the compression stage and, hence, the lost original image information becomes unrecoverable. For example, a solid blue background in a video scene can be compressed significantly with little degradation in apparent quality. However, other frames containing sparse amount of continuous or repeating image portions often cannot be compressed significantly without a noticeable loss in image quality.
Many video compression standards, including MPEG, MPEG-2, MPEG-4, H.261, and H.263 utilize a block-based Discrete Cosine Transform (DCT) operation on data blocks, 8×8 samples in size. A set of coefficients for each block is generated through the use of a two-dimensional DCT operation. Such coefficients relate to a spatial frequency content of the data block. Subsequently, the 64 DCT coefficients (one for each sample) in a block are quantized. For H.263, one quantizer step size is applied to every DCT coefficient in a data block and is part of the information that must be transmitted to a H.263 decoder. The quantization process is defined as a division of each DCT coefficient by the quantization step size followed by rounding to the nearest integer. An encoder applies variable uniform quantization to DCT coefficients to reduce the number of bits required to represent them. Compression may be performed on each of the pixels represented by a two-by-two array of blocks containing luminance samples and two blocks of chrominance samples. This array of six blocks is commonly referred to as a macroblock. The four luminance and two chrominance data blocks in a macroblock combine to represent a 16×16 pixel array.
In an H.263 encoder, variable uniform quantization is applied by means of the quantization parameter that provides quantization step sizes that may the values of DCT coefficients to a smaller set of values called quantization indices. In the H.263 decoder, DCT coefficient recovery is performed, roughly speaking, by multiplying the recovered quantization indices by the inverse quantization step size. The decoder then calculates an inverse DCT using the recovered coefficients.
Although the DCT and other methods have proven somewhat effective in utilizing spatial redundancy to limit the bit rate required to represent an image, there remains a need to improve video quality in a computationally-effective way. Video sequences tend to contain a large amount of temporal redundancy; in other words, areas of the current image are very likely to be similar to areas of a subsequent image. In any video compression method, motion estimation takes advantages of the temporal redundancy to reduce the required bit rate. Motion estimation is commonly performed between a current image frame and a previous image—the reference image frame. The motion estimation method typically uses an integer pixel grid, typically a macroblock for the current frame and a larger search space containing a co-located macroblock of the previous frame. A portion of the search area may be sampled to reduce the computational complexity of comparisons. A vector is generated to estimate temporal differences between where a macroblock appears in the current image and where the best representation appears in the reference image search area. The generated vector is a motion vector.
Contemporary video motion estimation methods have a trade-off between accuracy and computational cost (i.e., computation power and memory requirements). If a search algorithm requires a large number of comparisons to cover the search area, a great deal of computational power and time is required which can reduce the overall frame rate and thereby produce a jerky or erratic picture. If a small search area or small comparison set is used, a resulting picture may suffer from blocking defects. Consequently, there is a need for a system and method for computationally-efficient means of producing and evaluating motion vectors in a video frame.