1. Field of the Invention
The present invention relates to a method and apparatus for compression of multimedia data. More specifically, the present invention relates to a method and apparatus for predictive compression of video frames.
2. Description of the Related Art
The creation of pictures or images has been a human activity since the beginning of humanity. However, until recent history viewing of an image required the viewer to be physically present at the image. This was geographically cumbersome. Photography, both still and motion, broke this geographic constraint by allowing pictures to be captured and transported independent of the physical images they represented. Television enhanced transmission of images, by sending images, recorded or live, to any geographic location capable of receiving a radio signal. But, for the most part, viewers of television can only view images that are scheduled for transmission, rather than selecting images at will.
With the development of computers, and more specifically computers that are linked across a network, images stored on one computer may be demanded by a viewer, and almost instantaneously provided to the viewer""s computer over the computer network. One computer network that is increasingly being used is the Internet, the well-known international computer network that links various military, government, education, nonprofit, industrial and financial institutions, commercial enterprises, and individuals.
Images are typically of two types: 1) single pictures; or 2) moving pictures. Single pictures include photographs, computer art, faxes and web pages. Moving pictures typically include a number of single images or frames organized into a particular sequence. Within a computer network, images are captured and stored on one computer, and then transmitted over the network to another computer for viewing. An example of this is provided in FIG. 1, to which reference is now made.
FIG. 1 illustrates a computer system 100 that includes a server 102 connected to a number of mass storage devices 104. The mass storage devices 104 are used to store a number of video frames 120. The video frames 120 could be still images, or could be combined into sequences to create moving pictures, as described above. The sequences reside on the mass storage devices 104, and upon request, may be transmitted by the server 102 to other computers 108 via a network 106. In addition, the video frames 120 may be transferred to remote computers, such as the computer 112, via a network 116, using a router 110 and/or a modem 114. One skilled in the art should appreciate that the network 116 could be a dedicated connection, or a dial-up connection, and could utilize any of a number of network protocols such as TCP/IP or Client/Server configurations.
In operation, a user sitting at any of the computers 108, 112 would request video frames 120 from the server 102, and the server would retrieve the video frames 120 from the mass storage devices 104, and transmit the frames 120 over the network 106. Upon receipt of the video frames 120, the computers 108, 112 would display the images for the requester.
It should be appreciated that the computers 108, 112 may be positioned physically close to the server 102, or may be thousands of miles away. The computers 108, 112 may be connected to the server 102 via a direct LAN connection such as Ethernet or Token Ring, or may utilize plain old telephone service (POTS), ISDN or ADSL, depending on the availability of each of these services, their cost, and the performance required by the end user. As is typically of computer equipment and services, higher performance means more cost.
In most cases, the amount of data required to represent a video frame, or more specifically a sequence of video frames 120 is significant. For example, a color image or frame is typically represented by a matrix of individual dots or pixels, each having a particular color defined by a combination of red, green and blue intensities (RGB). To create a palette of 16 million colors (i.e., true color), each of the RGB intensities are represented by an 8-bit value. So, for each pixel, 24-bits are required to define a pixel""s color. A typical computer monitor has a resolution of 1024 pixels (across) by 768 pixels (down). So, to create a full screen image for a computer requires 1024xc3x97768xc3x9724 bits=18,874,368 bits, or 2,359,296 bytes of data to be stored. And that is just for one image.
If a moving picture is to be displayed, a sequence of images are grouped, and displayed one after another, at a rate of approximately 30 frames per second. Thus, a 1 second, 256 color, full screen movie could require as much as 60 megabytes of data storage. With present technology, even very expensive storage systems, and high speed networks would be overwhelmed if alternatives were not provided. By way of example, as the resolution and the frame rate requirements of a video increase, the amount of data that is necessary to describe the video also increases.
One alternative to reducing the amount of data required to represent images or moving pictures is to simply reduce the size of frames that are transmitted and displayed. One popular frame size is 320 pixels in width and 240 pixels in height, or 320xc3x97240. Thus, a 256 color frame of this size requires 320xc3x97240xc3x9724=1,843,200 bits, or 230 kilobytes of data. This is significantly less ({fraction (1/10)}th) than what is required for a full screen image. However, as frames are combined into moving pictures, the amount of data that must be transmitted is still significant.
An additional solution to reducing the amount of storage space required for video frames involves compressing the data. The extent to which data is compressed is typically measured in terms of a compression ratio or a bit rate. The compression ratio is generally the number of bits of an input value divided by the number of bits in the representation of that input value in compressed code. Higher compression ratios are preferred over lower compression ratios. The bit rate is the number of bits per second of compressed data required to properly represent a corresponding input value.
There are three basic methods involved in any data compression scheme: 1) transformation, 2) reduced precision (quantization), and 3) minimization of number of bits (encoding). Each of these methods may be used independently, or may be combined with the other methods to obtain optimum compression. Although the number of scheme combinations is large, typically compression is accomplished by a sequential process of transformation, precision reduction, and coding. Coding is always the final stage of the process, but there are sometimes several transformation and precision reduction iterations. This process is summarized in FIG. 2, to which attention is now directed.
In FIG. 2, a block 202 is shown to illustrate the step of transformation, a block 204 is shown to illustrate the step of quantization, and a block 206 is shown to illustrate the step of coding. The transformation block 202 transforms a data set into another equivalent data set that is in some way smaller than the original. Some transformations reduce the number of data items in a set. Other transformations reduce the numerical size of data items that allow them to be represented with fewer binary digits.
To reduce the number of data items in a set, methods are used that remove redundant information within the set. Examples of such methods include Run-Length-Encoding (RLE) and LZW encoding. RLE is a pattern-recognition scheme that searches for the repetition of identical data values in a list. The data set can be compressed by replacing the repetitive sequence with a single data value and a length value. Compression ratios obtainable from RLE encoding schemes vary depending on the type of data to be encoded, but generally range from 2:1 up to 5:1. LZW encoding replaces repeated sequences within a data set with particular codes that are smaller than the data they represent. Codebooks are used during encoding and decoding to transform the data set back and forth from raw data to encoded data. Compression ratios for video images range from 2:1 to 9:1.
Transformations that reduce the size of individual data items within a data set includes Differencing. Differencing is a scheme that attempts to reduce the size of individual data values within a data set by storing the difference between pixels values, rather than the actual data values for each pixel. In many cases the difference value is much smaller in magnitude than the original data value, and thus requires a smaller data space for storage.
Other transformation schemes exist to transform a set of data values from one system of measurement into another, where the properties of the new data set facilitate the data""s compression. One such scheme called colorspace conversion transforms the RGB pixel values into luminance Y, and chrominance Cb and Cr values. This is referred to as RGB/YUV conversion. Less important values, such as the Cr component may be ignored without significantly affecting the image perceived by a viewer.
Another scheme that transforms a set of data values from one system of measurement into another is the Discrete-Cosine-Transform. The DCT transforms a block of original data that typically represents color intensity (YUV) into a new set of values that represent cosine frequencies over the original block of data. Lower frequencies are stored in an upper left portion of the data block with higher frequencies stored in the rest of the block. If higher frequency components are ignored, an entire block of data may be represented by just a few data values in a block.
It should be appreciated that each of the schemes described above are well known in the art, and may be combined, for a particular frame of data, to achieve maximum compression. However, each of these schemes are applied to a single video frame, called intra-frame compression, which is independent of other video frames. For full motion video, including multicast video, teleconferencing, and interactive video, compressing each video frame separately is not sufficient, because of the large number of frames in even a short video sequence. Further compression may be achieved by taking advantage of the similarities between frames. In many instances, the difference between one frame and the next is small because of the short time interval between frames. These schemes are referred to as inter-frame compression.
One simple scheme stores only the pixels that actually change from one frame of the video sequence to the next. Said in a technical way, the scheme is to store only the pixels that produce a nonzero difference when subtracted from their corresponding pixels in a previous frame. Thus, rather than having to transmit all of the pixel values in a video block, only those pixels that have changed need to be transmitted.
Another approach to video compression is to calculate the differences between corresponding pixels in consecutive frames and then encode the differences instead of the original values. This is called motion compensation. But, in motion pictures, pixel values often shift their spatial location from one frame to the next. To locate shifted pixels, a number of pixel values are grouped together to form a block. Then, a block within a present frame is compared to blocks in a previous frame to determine an offset such that all of the pixel differences are minimized. This is called motion estimation. An offset is typically represented as a pair of numbers that specify a shift in the horizontal and vertical directions. This is referred to as a motion vector. If a motion vector can be determined for a particular block, that block may be encoded simply by supplying the motion vector, rather than by encoding the entire block.
With each of the above transformation schemes, reduced precision may be used to further compress data, as shown by block 204. As was mentioned above, one of the chrominance values, Cr, could be ignored without significantly affecting the quality of the image. In addition, after performing a DCT transform, higher frequency components can be ignored. Furthermore, by calculating differences between pixel values, and ignoring minor differences, further compression may be achieved. This illustrates the repetition between the transformation block 202 and quantization block 204 of FIG. 2.
The third block shown in FIG. 2 is the Code block 206. This block encodes a data set to minimize the # of bits required per data item. The coding process assigns a unique code value to data items in a set. One coding scheme that is used in compressing video frames is Huffman coding. Huffman codes assign a variable-length code to each possible data item, such that the values that occur most often in the data set have smaller length codes while the values that occur less frequently have longer-length codes. Huffman coding creates a tree structure where the leaf nodes are the original probabilities associated with each data value from the data set. Each branch in the tree is labeled with a one or a zero. The Huffman code assigned to each original data value is the set of labels along a path from the root node to the associated leaf node.
The above provides a general overview of a number of different compression schemes for compressing video frames prior to transmitting the frames over a network to a remote computer. It should be appreciated that specific implementation of any of these schemes, or more accurately, a combination of particular ones of these schemes, requires significant preprocessing (encoding) of the video frames prior to transmission, as well as post processing (decoding) of the frames.
As the complexity that is associated with compression and decompression increases, the efficiency with which video frames may be encoded and decoded drops. Stated another way, higher compression ratios require more processing, and take longer to encode/decode than do lower compression ratios. However, higher compression ratios allow more data to be delivered over a network in less time. Therefore, a tradeoff is generally made between obtaining a particular compression ratio, and obtaining a satisfactory bit rate of transfer. If a high compression ratio takes too long to decode, viewed images will appear choppy or disjunct. If an inadequate bit rate is obtained, a viewer will be kept waiting for the image, or the image will replay in slow motion.
What is needed is an apparatus and method that improves the efficiency of encoding/decoding video frames while maintaining a desired bit rate for a given resolution. More specifically, what is needed is an apparatus and method that incorporates several forms of motion estimation, and selects the best form for each block of data to be encoded.
Accordingly, it is a feature of the present invention to provide a method to encode a video frame that is transmitted over a communications medium. The method includes: 1) obtaining a video frame; 2) separating the frame into blocks; 3) encoding a plurality of blocks using inter compression; 4) encoding the plurality of blocks using predictive intra compression; and 5) selecting better block compression between the inter and predictive intra compression; wherein the steps of encoding the plurality of blocks is performed on a block by block basis, to provide optimum compression of the video frame for a given bit rate.