The present invention relates to the compression of digital data, and more particularly to a method and apparatus for interpolating pixels in a motion compensated digital video system.
Television signals are conventionally transmitted in analog form according to various standards adopted by particular countries. For example, the United States has adopted the standards of the National Television System Committee ("NTSC"). Most European countries have adopted either PAL (Phase Alternating Line) or SECAM standards.
Digital transmission of television signals can deliver video and audio services of much higher quality than analog techniques. Digital transmission schemes are particularly advantageous for signals that are broadcast via a cable television network or by satellite to cable television affiliates and/or directly to home satellite television receivers. It is expected that digital television transmitter and receiver systems will replace existing analog systems just as digital compact discs have largely replaced analog phonograph records in the audio industry.
A substantial amount of digital data must be transmitted in any digital television system. This is particularly true where high definition television ("HDTV") is provided. In a digital television system, a subscriber receives the digital data stream via a receiver/descrambler that provides video, audio, and data to the subscriber. In order to most efficiently use the available radio frequency spectrum, it is advantageous to compress the digital television signals to minimize the amount of data that must be transmitted.
The video portion of a television signal comprises a sequence of video "frames" that together provide a moving picture. In digital television systems, each line of a video frame is defined by a sequence of digital data bits referred to as "pixels". A large amount of data is required to define each video frame of a television signal. For example, 7.4 megabits of data is required to provide one video frame at NTSC resolution. This assumes a 640 pixel by 480 line display is used with 8 bits of intensity value for each of the primary colors red, green, and blue. High definition television requires substantially more data to provide each video frame. In order to manage this amount of data, particularly for HDTV applications, the data must be compressed.
Video compression techniques enable the efficient transmission of digital video signals over conventional communication channels. Such techniques use compression algorithms that take advantage of the correlation among adjacent pixels in order to derive a more efficient representation of the important information in a video signal. The most powerful compression systems not only take advantage of spatial correlation, but can also utilize similarities among adjacent frames to further compact the data. In such systems, differential encoding is usually used to transmit only the difference between an actual frame and a prediction of the actual frame. The prediction is based on information derived from a previous frame of the same video sequence. Substantial efficiency can be achieved due to the significant amount of frame-to-frame redundancy that is typical in television program signals.
An example of a video compression system using motion compensation is described in Ninomiya and Ohtsuka, "A Motion-Compensated Interframe Coding System for Television Pictures," IEEE Transactions on Communications, Vol. COM-30, No. 1, January 1982. The motion estimation algorithm described therein is of the block-matching type. In this case, a motion vector is determined for each block in the current frame of an image by identifying a block in the previous frame which most closely resembles the particular block. The entire current frame can then be reconstructed at a decoder by sending the difference between the corresponding block pairs, together with the motion vectors that are required to identify the corresponding pairs. Often, the amount of transmitted data is further reduced by compressing both the displaced block differences and the motion vector signals. Block matching motion estimation algorithms are particularly effective when combined with block-based spatial compression techniques such as the discrete cosine transform (DCT).
Other examples of motion compensation systems can be found in U.S. Pat. No. 4,802,006 to Iinuma, et al., entitled "Signal Processing Unit for Producing a Selected One of Signals Predictive of Original Signals," U.S. Pat. No. 4,816,906 to Kummerfeldt, et al., entitled "Method for Motion-Compensated Frame-to-Frame Prediction Coding," U.S. Pat. No. 4,827,340 to Pirsch, entitled "Video-Signal DPCM Coder with Adaptive Prediction," U.S. Pat. No. 4,897,720 to Wu, et al., entitled "Circuit Implementation of Block Matching Algorithm," and European patent publication no. 0 237 989 to Takenaka, et al., entitled "Differential Coding Apparatus Having an Optimum Predicted Value Determining Circuit." In such prior art systems, a search area in a previous frame is typically searched by placing a block of pixels from the current frame at the upper left-hand corner of the search area and calculating the error (e.g., mean square or mean absolute) with respect to the overlapped pixels in the search area. The block from the current frame is then moved pixel by pixel to the right-hand boundary of the search area. At each step the error with respect to the overlapped pixels of the search area is calculated. The block of the current frame is then moved down one row of pixels in the search area and again moved pixel by pixel from the left-hand boundary of the search area to the right-hand boundary. This process continues until an error function is calculated for all possible block positions in the search area.
When the prediction of a current frame block from a previous frame block is good, i.e., the prediction frame bears a close resemblance to the frame to be transmitted, only a small amount of residual error remains for transmission. This leads to a high compression efficiency. If a bad prediction is made, then the residual error may be so large that the compression efficiency is adversely affected. Thus, an accurate prediction of the frame-to-frame movement in a video sequence is essential in achieving a high compression ratio.
For a typical video sequence, the scene may contain many objects that move independently at various speeds and directions. In order to ease hardware implementation and limit the amount of information needed to represent each movement, a frame of video is often segmented into rectangular blocks as noted above. One then assumes that only the blocks are moving with independent speeds and directions. In order to reduce system complexity and increase speed, the area which is searched for the best match between a current frame block and a previous frame block may be limited to the neighborhood around the target block. This limitation in the search area is usually acceptable because the movement of an object in most typical video sequences is seldom fast enough to create a large displacement from one frame to the next. With a limited search area, it is possible to efficiently perform an exhaustive search to find the best match. Once the best match is found, the prediction frame is constructed by assembling all the best matching blocks together. To implement this in hardware, the previous frame is stored in a random access memory and the prediction frame is generated block by block from the memory by reading one pixel at a time using the proper displacement vector for that block.
This method produces a good prediction frame when the objects in a video sequence are displaced both vertically and horizontally by an integer number of pixels. However, for a typical video sequence, the object movements are not usually an integral number of pixels in distance. For those cases where the displacement falls between two pixels, a better prediction frame can be generated by using values that are interpolated from adjacent pixels. If one considers only the midpoints between pixels, there are three possible modes of interpolation, i.e., horizontal, vertical and diagonal. Horizontal interpolation consists of taking the average of two horizontally adjacent pixels. Vertical interpolation is generated by computing the average between two vertically adjacent pixels. Diagonal interpolation requires the averaging of four neighboring pixels.
One method of implementing interpolation in a prediction scheme is to store the previous frame in a random access memory and to perform the interpolation on a pixel by pixel basis. For diagonal interpolation, it is necessary to access four pixels from memory in order to compute one interpolated value. For horizontal and vertical interpolation, two pixels are needed to generate one interpolated pixel. Thus, for any of the three interpolation modes, the number of memory access cycles needed to obtain the pixels necessary for interpolation exceeds the number of interpolated pixels that are generated. This fact reduces the throughput of the system to a point where it may not be practical.
Another method for implementing interpolation is to precompute the interpolated values for the entire search window based on each of the three interpolation modes. The precomputed values are stored in three separate memories. When the block displacement vector uses any one of the three interpolative modes, the prediction block is generated by accessing the appropriate memory. The requirement for three additional memories in such a scheme is obviously disadvantageous in view of the hardware complexity and cost.
It would be advantageous to provide a new method for generating interpolated pixels that is efficient, relatively easy to implement, and cost effective. Such a scheme should not require the use of additional memories. Also, the number of memory access cycles should be limited to no more than the number of interpolated pixels generated. The present invention provides a method and apparatus enjoying these and other advantages.