A full motion video sequence usually contains a significant amount of frame-to-frame redundancy. For video phone or teleconferencing applications, motion in the entire scene is usually low and successive frames are highly correlated. In such a case, interframe coding techniques can reduce the information redundancy in video sequences and achieve high data compression.
The block matching motion compensation algorithm is widely used in many video codecs to remove interframe redundancy. The basic idea in the block matching algorithm is to divide the current frame in the video sequence into blocks, and for each block to search for a best match position in a search window of the previous video frame.
The best match of the current block in the search window is determined by finding the minimum value of an error function, ##EQU1## where the a(i,j)'s are pixel values of the current block of the current frame, the b(i,j)'s are the pixel values of the search window of the previous frame, i is a vertical coordinate index, j is a horizontal coordinate index, and m.sub.i,m.sub.j represents a candidate displacement vector. The power factor q usually equals 1 or 2.
The displacement vector (i.e., value of m.sub.i, m.sub.j) which minimizes the error function E is called the motion vector.
The current block of the current video frame may be coded as follows. The difference between the current block and the best match block in the search window of the previous frame is obtained. This difference is then compressed using the Discrete Cosine Transform, quantization of transform coefficients and variable length coding. Given the motion vector information, the receiver can then reconstruct the current block using the available previous frame and the compressed block difference. The better the motion compensated prediction, the higher is the efficiency in compressing the block difference. Thus, the use of motion compensation results in a tremendous reduction in the amount of bits used to code a frame.
There are several ways in which the best match position of a current block of a current video frame can be found in a search window of a previous video frame. One method is the full search method in which the error function for every possible displacement of the current block in the search window is evaluated.
Currently, there are two types of architectures which can be used to implement a full search block matching algorithm. One architecture is a two-dimensional systolic array (see, e.g., IEEE ICASSP, pp. 1687-1690, 1989; IEEE Trans. on CAS, Vol. 36, No. 10, October, 1989, pp. 1309-1316; and IEEE Trans. on CAS, Vol. 36, No. 10, October 1989, pp. 1301-1308.) The two-dimensional systolic array architecture has a very high computation power and can be applied to video signals of high resolution or with a high sampling rate. The disadvantage of this architecture is that when implemented in a chip, the chip size is large and such chips are difficult to produce with reasonable yields given the current state of the art semiconductor processing technology.
The other type of architecture is the one-dimensional systolic array. (See, e.g., IEEE Trans. on CAS, Vol. 36, No. 10, pp. 1317-1325, Oct. 1989 and U.S. Pat. No. 4,897,720). This architecture has the advantage of a reasonable chip size.
The operation of the prior art one-dimensional systolic array is explained in greater detail below.
FIG. 1 shows a current video frame i and an immediately previous video frame i-1. Illustratively, the current video frame is coded by dividing the current frame i into N.times.N blocks. One such N.times.N block (i.e., the current block) of frame i is shown in FIG. 1. A search window for the previous video frame i-1 is also shown in FIG. 1. The search window comprises the pixels of the previous frame which correspond to the current block (shown as shaded in FIG. 1) and additional pixels in each direction. Thus, the search window has dimensions (p+p+1+N).sup.2. Illustratively, N=2p+2. Typically, N=16 and p=7.
In accordance with the full search technique, the search window of the previous frame is searched by placing the current block at the upper lefthand corner of the search window and calculating the error function with respect to the overlapped pixels of the search window. The current block is then moved pixel by pixel to the righthand boundary of the search window. At each step, the error function with respect to the overlapped pixels of the search window is calculated. The current block is then moved down one row of pixels in the search window, and the current block is then moved pixel by pixel from the left boundary of the search window to the righthand boundary. At each step the error function between the current block and the overlapped pixels of the search window is calculated. The current block is then moved down another row in the search window and then moved pixel by pixel from right to left. This process is continued until an error function is computed for all possible positions of the current block in the search window. (Hence the name, full search block matching algorithm.) The error functions for all positions (i.e,. all possible values of m.sub.i, m.sub.j) are compared to find the best match block in the search window, i.e., the position of the current block in the search window with the minimum error.
A prior art circuit module in the form of a one-dimensional systolic array for preforming the full search block matching algorithm is shown in FIG. 2. The module 10 of FIG. 2 is a variation of the circuit disclosed in U.S. Pat. No. 4,897,720. The module 10 of FIG. 2 is efficient because it can compute in parallel the error functions for one row of positions of the current block in the search window.
The module 10 of FIG. 2 comprises three pixel inputs. An input C receives the pixels of the current block. An input P receives the pixels from the lefthand side of the search window (see FIG. 1) and an input P' receives the pixels from the righthand side of the search window. The sequence in which the pixels arrive at the inputs is described below.
The module 10 comprises a plurality of processing elements labeled PE-0, PE-1, . . . , PE-15. The number of processing elements is equal to 2p+2. Each processing element calculates the error function of one position of the current block in the search window. The processing elements take advantage of the fact that the error function calculations of adjacent positions of the current block in the search window utilize a significant number of pixel values in common. Thus, the processing elements can calculate the error functions for one row of positions of the current block in the search window in parallel.
The module 10 of FIG. 2 operates cyclically. FIG. 3 is a table which indicates the pixel inputs which are present at the C,P, and P' inputs for each cycle to calculate in parallel the error functions for one row of positions of the current block in the search window.
The pixels a(i,j) from the current block and b(i,j) from the search window arrive at the inputs C, P and P' in the order shown in FIG. 3.
In FIG. 3, the upper lefthand pixel of the current block is designated a(0,0). The pixel b(0,0) designates the upper lefthand pixel of the search window. In FIG. 3, the pixels shown are the pixels needed to calculate the error functions for the topmost row of positions of the current block in the search window.
FIG. 3 shows that the pixel values a(i,j) arrive at the input C in each cycle 0,1,2, in raster scan order. The pixels propagate serially down the chain of flipflops 12.
The module 10 comprises a set of multiplexers 14. Each of the multiplexers 14 has an output connected to one of the processing elements PE-0, PE-1, . . . , PE-15. Each multiplexer has two inputs which are the inputs P and P'. In each cycle, one of the two multiplexer inputs is connected to the associated processing element. The module 10 also includes a chain of flipflops 16. There is one flip-flop 16 associated with each multiplexer 14. The state of the associated flipflop 16 during a cycle determines whether the corresponding multiplexer input P' or P is connected to the associated processing element. As shown in FIG. 3, within each group of sixteen cycles is a line 100. At the start of each group of sixteen cycles the states of the flipflops 16 are reset via line 17. Then a signal propagates down the chain of flipflops 16 so that one additional flipflop has its state set in each succeeding cycle of the group of sixteen cycles. When the states of the flipflops 16 are established in this manner, the pixel b(i,j) present at the input P is broadcast to all processing elements to the left of the line 100 and the pixel b(i,j) present at the input P' is broadcast to all processing elements to the right of the line 100.
FIG. 3 also indicates the calculation performed by each processing element in each cycle. In each cycle, the difference between a particular pixel value a(i,j) and a particular pixel value b(i,j), received from the P or P' input depending on the states of the associated multiplexer, is obtained. The absolute value or square of the difference is obtained (depending on whether the power factor q=1 or q=2 is used in the error function formula) and the result is accumulated with results obtained in previous cycles until a complete error function is obtained. As indicated above, each processing element computes the error function for one position of the current block in the search window and the error functions for a whole row of positions are calculated in parallel.
At cycle 255, the processing element PE-0 completes the calculation of its error function. The uppermost one of the tristate devices 19 is enabled by an enable signal so that the error function can be transmitted to a comparator 30. The remaining processing elements PE-1, PE-2, . . . , PE-15 complete the calculation of their error functions in the succeeding fifteen cycles. The enable signal propagates down the chain of flipflops 20, so that in each of the succeeding cycle, a tristate buffer 19 is enabled and the corresponding error function is transmitted to the comparator 30. After all the error functions for all the positions of the current block in the search window have been determined, the comparator 30 outputs an indication of the position which is the best match, i.e., the position with the minimum error function.
With respect to the data flow of FIG. 3, it should be noted that the input of the pixels a(i,j) necessary to compute the error function of the topmost row of positions of the search window is completed at cycle 255. Thus, starting at cycle 256 the sequence of pixels a(i,j) repeat as these pixels are again needed in the computation of the second row of positions of the current block in the search window. Similarly, the pixels b(i,j) needed for the computation of second row are inputted at the input P starting at cycle 256. The pixels b(i,j) needed for the computation of the error function of the second row have their vertical coordinate index increased by one in comparison to the pixels used in the computation of the error function of the first row of positions. The pixels b(i,j) inputted at the input P' for the computation of the error function for the second row of positions start at cycle 272 (not shown in FIG. 3). It should be noted that the calculations to the left of the lowermost line 100 of FIG. 3 belong to the computation of the error functions for the second row of positions.
The above-described operation of the module 10 of FIG. 2 is repeated for each row of positions of the current block in the search window. In this manner the one-dimensional systolic array which forms the module 10 of FIG. 2 carries out a full search block matching algorithm.
As indicated above, in the module 10 of FIG. 2 pixel values of the current block are transmitted serially from one processing element to the next in the one-dimensional systolic array, while in each cycle, search window pixels are broadcast from the P and P' inputs to a particular number of processing elements depending on the state of the flipflops 16. In an alternative one-dimensional systolic array, pixel values from the current block may be broadcast to the processing elements, while search window pixels are transmitted serially from one processing element to the next.
It is an object of the present invention to combine a plurality of one-dimensional systolic array modules to increase the computation power. (For these interconnections, the flipflops 120 and 200 are added at the bottom of chains 12 and 20. These extra flipflops are not needed in the conventional one-dimensional systolic array modules, but rather permit a plurality of such modules to be connected in tandem.)
In particular, it is an object of the present invention to combine a plurality of one dimensional systolic array modules to permit the error functions of a plurality of rows of positions of the current block in the search window to be calculated in parallel. It is also an object of the invention to combine a plurality of one dimensional systolic array modules to increase the size of the search area.