1. Field of the Invention
The invention relates in general to a motion estimation apparatus for video image processing. In particular, the invention relates to an apparatus for implementing a block matching scheme for the motion estimation algorithm for video image processing.
2. Technical Background
In the application of video image processing technologies in areas such as high definition television (HDTV), video telephones and video conferencing, the use of video signal compression techniques is one of the key factors when system performance and efficiency are considered. A high compression ratio for a video signal can be translated directly into good performance and high signal processing efficiency of the system. In order to obtain a high compression ratio for video signals, so that digital video data can be processed in the system at lower bit rates, an efficient encoding system and efficient hardware must be used. Typically, an efficient encoding scheme implemented by an encoding system would combine several techniques including, for example, motion compensation, digital cosine transform, visual characteristics quantization, Huffman coding, etc.
Motion compensation for video signal processing is a technique by which the video image signals are manipulated in the time domain, based on the statistical characteristics of video signals. In principle, if consecutive video image frames at very short time intervals have their image blocks analyzed, it is frequently found that each of the analyzed image blocks will normally have a relatively small difference in its video characteristics. This characteristics of the video image, which is the primary difference when still image characteristics are compared, defines the underlying principle for many of the video image compression schemes. The motion compensation technique used thus has an important and significant influence over the compression ratio factor for video image compression and encoding systems.
Motion estimation is the basis for motion compensation techniques. Successful implementation of a motion compensation technique relies on the precision, speed and efficiency of the algorithm that implements the technique. Among the various processes developed for implementing the motion estimation technique, block matching is one that is relatively simple and dear to implement most easily in terms of substantial hardware, and as such has been widely utilized in this area. Of block matching algorithms used for implementing motion estimation in video image processing systems, there are at the present stage several known algorithms, which include full search algorithm, three-step search algorithm, cross-search algorithm, orthogonal search algorithm, etc.
Fast block matching algorithms, as represented by the three-step search algorithm, employ multiple procedural steps to achieve block image matching. Not all possible image blocks are compared, and therefore computational operations are reduced in number. However, any two consecutive procedural steps must still be performed in sequence, which reduces the possibility of parallel implementation, and therefore, hardware logic, employed for implementing such fast block matching algorithms, are required to support extremely high throughput, along with the other requirements of low latency and programmability, and computation logic employing tree architecture becomes the ideal solution for implementing these algorithms.
However, conventional computation logic configurations featuring a tree architecture still require a larger number of processing elements, and time delays in stages of the pipeline are significant enough to limit the clock frequency for processing elements. To examine the reason, a conventional, four-channel tree architecture is taken as an example and briefly described below with reference to the accompanying drawings.
Block matching algorithms make use of the mean values of the absolute error function as the basis for measuring the degree to which matching is achieved. The image block featuring the minimum mean absolute error is the one that matches. Mean absolute error represents the average value obtained by summing all the absolute values of the differences between the respective values of corresponding pixels in the compared and the original image blocks, and then dividing by the total number of processed pixels. Thus, the hardware architecture utilized to implement these block matching algorithms must at least be capable of handling arithmetic operations including subtraction, obtaining an absolute value, summation, and determining a minimum value.
For example, FIG. 1 schematically shows the hardware logic block diagram of a conventional four-channel tree architecture, which can be implemented by computer software or circuit arrangement. In the drawing, it is first assumed that both the original and the compared image blocks each have a pixel dimension of four, represented by pixel data X1, X2, X3 and X4 and Y1, Y2, Y3 and Y4. It is further assumed that each of the pixels has n bits of characteristic data. The X and Y pixel data are expressed as EQU X={x.sub.n-1, x.sub.n-2, . . . , x.sub.0 }
and EQU Y={y.sub.n-1, y.sub.n-2, . . . , y.sub.0 }
respectively, wherein x.sub.i and y.sub.i are pixel data bits for X and Y pixel data, respectively, and are all positive numbers.
The four-channel tree architecture illustrated in the drawing has a total of five computation stages, divided into four portions. Each of the computation stages requires one clock cycle of processing before it can send out its output. As a result, in such a pipelined processing architecture, a total of five clock cycles will be required to conclude one complete computation. As shown in the drawings, the first computation stage is the D computation stage identified by reference numeral 100, which includes four D computation members 105. Each of the D computation members 105 is independent and is responsible for computing the absolute value .vertline.X-Y.vertline.. With four such D computation members 105, all the four pairs of pixel data in the original and the compared image blocks can be processed to subtract one member of each pair from the other and provide the absolute value of the subtraction.
The two stages in the five-stage processing pipeline next to and downstream from the first D computation stage 100 are formed by a summation section 110 that includes first and the second computation stages 112 and 114. The first computation stage 112 includes two A adders 118, while there is only one A adder 119 in the second computation stage 114. Each of the included A adders 118 and 119 is capable of adding its two inputs. Therefore, the summation section 110 may be used to add together all the absolute values generated by the four respective D computation members 105, as shown in the drawing, where one adder 118 adds the inputs received from two of the D computation members 105, the other adder 118 adds the inputs from the other two, and the adder 119 subsequently adds the outputs of the two adders 118.
The fourth stage in the five-stage processing pipeline, immediately following the summation section 110, is the accumulator stage 120. The accumulator stage 120 includes at least an independent A adder 125, capable of adding the output of the third stage 114 into its current accumulated value. On occasions wherein there are more than four pixels per divided image block that require processing, this basic configuration can be expanded multiply to provide subsequent processing. In other words, the image block to be analyzed can be divided into a number of sub-units each including four pixels and subjected to processing as described. With proper implementation of the procedure, larger image blocks can be processed, but obviously these require an extended time period to complete.
The last of the five computation stages is the minimum evaluation stage 130. As shown in the drawing, this stage includes a minimum value evaluator element 135 that is capable of comparing and identifying the minimum of two values that have been provided thereto. One of the compared values is the value generated by the accumulator stage 120, which is also the summation of the absolute errors of the currently compared image block. The other compared value is the recorded minimum value obtained in the previous comparisons of the summations of the absolute errors for the compared and original image blocks. After all the possible image blocks are compared, the location of the compared image block having the minimum value may be obtained, together with its shift with respect to its corresponding original image block. The shift can be utilized as the motion vector 140, as generated by the last stage 130 of the five-stage pipelined processing architecture of FIG. 1.
Assume Z=X-Y Since both X and Y are positive numbers, Z, comprising n bits, can be expressed as EQU Z={z.sub.n-1,z.sub.n-2, . . . , z.sub.0 }.
Then, a scheme for calculating the numerical binary value of .vertline.X-Y.vertline. can be implemented by the following procedural steps:
a. Obtain the 2's complements for Y. Because Y is the subtrahend only its 2's complement is calculated. To obtain the 2's complement of a binary number, 1 is added to its 1's complement. In other words, all the bits of the number are inverted and then 1 is added to the result, as persons skilled in this art are well aware.
b. Utilize an adder to add the X value to the two 2's complement of the Y value, to obtain Z. And,
c. If z.sub.n-1 (the most significant bit, or MSB, of Z) has a value of 1, this means that Z, obtained by subtracting Y from X, is a negative number, which is an indication of the condition Y&gt;X. In this case, the 2's complement of Z will have to be obtained to obtain the value of .vertline.Z.vertline., that is, .vertline.X-Y.vertline.. on the other hand, when z.sub.n-1 has a value of 0, then Y&lt;X, so that it is necessary to take an absolute value, since the value of Z=X-Y is already a positive number.
FIG. 2 is a schematic diagram of the D computation member utilized in the conventional computation logic of FIG. 1. As shown in the drawings, the value Y is first applied to one input of a two-input exclusive-OR gate 210, which has its other input tied to the fixed logical value of 1. This is equivalent to obtaining the 1's complement of the value Y at the output 215 of the gate 210. An adder 220 is then utilized to add this 1's complement of Y to the other value X. The carry-in input CI of the adder 220 is also set to the fixed logical value of 1. Thus, 1 is added to the 1's complement of the value Y during addition, resulting in the addition of the value X to the 2's complement of the value Y. As a result, the summation value X-Y is obtained at the output 225 of the adder 220, with the carry-out 226 issued as the sign bit z.sub.n-1.
Next, the value of X-Y, that is, the output 225 of the adder 220, and the sign bit 226 thereof, can be provided to the second exclusive-OR gate 230. The arrangement is such that the value X-Y is applied to one of the two inputs of the exclusive-OR gate 230, while the sign bit 226, that is, z.sub.n-1, is applied to the other input thereof. If the sign bit 226 is a logical 1, bits of the value 225 of X-Y are each exclusive-ORed to obtain the 1's complement value thereof. If, on the other hand, the sign bit 226 has a logical value of 0, then bits of the value 225 of X-Y each remain uninverted. The output 235 of the exclusive-OR gate 230 are then provided to a second adder 240, where the sign bit z.sub.n-1 is added into the total value as the carry-in to obtain .vertline.Z.vertline., that is, .vertline.X-Y.vertline., which is then stored in a register 250 for the processing needs of the pipelined processing described above. Note, however, that one of the two adder inputs of the adder 240 is tied directly to a fixed logical value of 0. Thus, if the sign bit 226 is a logical 1, indicating a negative result of the first addition, the output of the adder 240 will be the 2's complement of this result, that is, the absolute value of the result. This conventional tree-architecture hardware configuration, due to several factors, has at least the following drawbacks. First of all, the D computation member 105 in the first D computation stage 100 requires latent delay time of two adders arranged in series. This places a restriction on the operating clock rate of the entire system. Secondly, since each D computation stage 100 includes two sets of adders, a total of three times the number of A adders required for each set are required for the entire hardware configuration. Adders, as persons skilled in the art should all be aware, considerably increase the complexity of the hardware.