This invention relates to the processing of video, and in particular to a technique for compression of video data in which a foreground image is separated from a background image for estimation of motion vectors.
Compressed video technology is growing in importance and utility. Analog or digital xe2x80x9cNTSC-likexe2x80x9d video transmissions require bit rates on the order 100 megabits per second. Compression technology today can reduce the required bit rate to less than 5 megabits per second. This is typically achieved using digital signal processing or VLSI integrated circuits.
Depending upon the ultimate bit rate and quality of the desired image, different types and levels of compression can be employed. Generally, the compression removes different types of redundancy from the video image being compressed. In doing so, the image is typically broken into groups of pixels, typically blocks on the order of 16 pixels by 16 pixels. By comparing different blocks and transmitting only information relating to the differences between the blocks, significant reductions in bit rate are achieved.
In addition, because some information within a block is imperceptible to the viewer, vector quantization or discrete cosine transforms can be used to remove bits corresponding to imperceptible or unimportant details. This further reduces the required bit rate, but may introduce certain degradation in the resulting image quality. A third technique for reducing the bit rate, and of primary focus here, is that stationary images, or moving objects, do not necessarily require retransmission of every detail. Motion compression techniques can be used to eliminate redundancies between frames. This is typically achieved by identification of a block of pixels which is considered xe2x80x9cmovedxe2x80x9d between two frames. Then transmission of only the motion vector information, in place of all of the pixel data, effectively transmits the new location of the block for reconstruction at the decomposer.
In many video scenes, an object moves against an essentially unchanging background. In such circumstances, most of the background data can remain the same for frame after frame of the video data, with the foreground object being shifted and revised as needed. One such example is videoconferencing in which the overall room or setting for the videoconference remains essentially unchanged. In the foreground, however, individuals may be speaking or gesturing. For such applications it is desirable to perform motion estimation. In motion estimation a vector is determined which relates the content of one video frame to the content of another video frame. For example, the vector might indicate the direction of motion of a portion of the contents of the earlier video frame. Use of such motion estimation enables video recording to use fewer bits. This is because the background portion of the scene can be characterized as having the same or almost the same data as the preceding frame, while the object in the foreground can be characterized as being essentially the same as an earlier frame, but moved to a new location.
FIG. 1 illustrates the motion estimation process. In FIG. 1b a current frame (picture) is shown, while FIG. 1 a shows a reference frame . It is desired to characterize the content of the current picture as being the same as the content of the reference picture, but with a changing portion of the current picture, designated the xe2x80x9cblockxe2x80x9d in the reference picture, together with a motion vector (u, v). The location of the block is usually given by the coordinates of its upper left corner, together with some information about its size.
One computationally intensive approach for determining the reference vector is to search the entire frame for the best fit. Using such procedure, every possible location for the block is determined, and the resulting motion vector computed. The motion vector chosen is the one that results in the best match between the estimated image and the current image. Such an approach, however, is computationally inordinately expensive, and is essentially impractical for ordinary use.
There are, however, various fast searching methods for motion estimation. These methods significantly reduce the computational cost of searching, but impose limitations. The essence of these approaches is to reduce the number of block search operations. These approaches can be characterized into two different groups-global search and step by step search. Each of these techniques is individually well known.
In global search approaches for determining the motion vector for a reference block, the system tries to find the best matching block in a frame of video information by moving around the frame at many widespread points and comparing blocks at those locations with blocks in the reference frame. The system tries to match a minimal area first, then refines the search in that area. An example is a three-step search. The system first searches to find a minimal point (point of least difference), then searches blocks that are two pixels away from the minimal point. Finally, the system searches the blocks that are next to the new minimal point. The particular values, of course, can be adjusted for different applications. The average number of operations in this type of global search is on the order of 40. In this method, every possible motion vector in the searching area is checked and compared. The motion vector with the lowest Sum of Absolute Difference value (SAD) of the two compared image blocks is selected, and coded. The result is that a high compression ratio is achieved.
The advantage of such an approach is its ability to quickly approach the minimal area. For fast moving video images, this is important because the matching block may be a relatively long distance, for example, 10 pixels, away from the previous point. The global approach also makes searching time more predictable, because the global search always performs the same number of operations, even if the match is found on the first try.
A second fast search technique is the step by step search. In many types of video, for example, a videoconference environment, the background does not move, and the speaker does not move dramatically. If the encoder has enough computational resources, and encoding at a sufficient rate, for example, more than 10 frames per-second speed, the matching block likely will be found two or three pixels away. Step by step searches from the center thus may provide better results than a global search. One typical example of a step by step search is the diamond search. It begins searching from the center of the window, compares four neighbors (up, down, left, and right), and then selects the best match as the new center. The searching continues until the center does not further change.
In a videoconference environment, objects usually move very little from frame to frame. Typically, if the frame rate on the encoder is faster than 10 frames/second, most movement will be less than four pixels on a CIF image. This step by step search method yields better results in such condition than many other fast searching methods. It is also the best method for processing a background image block because such a block will not move during videoconferencing. Unfortunately, there are significant limitations on this method. If the optimal match is far away from the center, the number of processing steps increases rapidly, raising the processing time dramatically.
Accordingly, what is needed is an improved technique for searching and comparing video frames to provide better data compression on a faster basis.
This invention speeds compression by reducing the number of times it is necessary for the system to perform comparisons and estimate motion vectors. The new system separates the foreground image from the background image in a low frame rate environment. This separation enables use of an encoder, discussed below, with different algorithms for processing the foreground and background image. By doing so, the system reduces the time needed for motion estimation. This improves compression speed.
In particular, in a preferred embodiment of the system having video compression, a method is provided for estimating a motion vector representative of the difference between a reference frame and a current frame. Each of the reference frame and the current frame include background and foreground data. The operation is carried out by selecting a block of the current frame for analysis, then determining whether the block selected is a foreground block or a background block.
Following that determination, if the block is a foreground block, a comparison of the block with predetermined search points in the reference frame is performed. If the block is a background block, then a comparison of the block with predetermined search points in only a portion of the reference frame is performed. The resulting data is provided to the system for use in compression.
Typically the step of determining whether the block selected is a foreground block or a background block is performed by defining a boundary around a preselected number of search points in the reference frame, and then determining a comparison value for each of the search points inside the boundary and a comparison value for each of the search points near the boundary. If the comparison indicates a closer match for any of the search points near the boundary compared to those inside the boundary, then the block is considered to be a foreground block. Contrarily, if the comparison indicates a closer match for any of the search points inside the boundary compared to those near the boundary, then is considered a background block. Once the block type is determined, the comparison of foreground blocks is performed using a global search, typically a three-step search. The comparison of background blocks is performed using a step by step search, typically a diamond search.