1. Field of the Invention
The invention relates in general to a processing hardware configuration for manipulating video image data for compression. In particular, the invention relates to a processing hardware configuration for efficiently implementing a fast search motion estimation algorithm in a semiconductor device that has reduced physical dimensions and complexity.
2. Description of the Related Art
Video signal compression is an important technique for successful video signal processing in video equipment such as high definition television, video telephone, and video conference systems. Because video imaging systems process vast amounts of digital data, maintaining a data bit rate that is extremely low in the signal processing flow by utilizing a data compression technique becomes an important factor for smooth video signal processing. To achieve a low data bit rate in the video signal processing flow, in other words, to obtain a very high data compression ratio, good codec (coding-decoding) schemes and corresponding hardware systems are essential. Such codec systems typically implement schemes including motion compensation, digital cosine transform, and quantization of the weight of visual characteristics, as well as Huffman coding, and others.
Essentially, motion compensation is a scheme which greatly affects the data compression ratio achieved in video image compression coding systems. This technique is used to manipulate the video signal data compression based on the specific time domain statistical characteristics of the subject video image signals. Unlike situations where successive still images are processed, a video image is characterized by the fact that two successive video images frequentely have relatively few differences when corresponding constituent blocks in successive video image frames are compared. This is an advantageous characteristic of video images that allows implementation of a motion compensation scheme to achieve a high data compression ratio.
Motion estimation provides the basis for a motion compensation technique. Good motion estimation results are determined by the precision, speed, and efficiency achieved by the motion estimation scheme. Various algorithms are available for implementing motion estimation. Block matching is an algorithm that is one of the easiest to implement from a hardware perspective, because of its simple steps and rules for implementation. Commonly used block matching algorithms include, for example, full search algorithms (FSA), three-step search algorithms (TSSA), two-dimensional logarithmic search algorithms (TDLSA), cross search algorithms (CSA), orthogonal search algorithms (OSA), and hierarchical search algorithms (HSA), among others.
Algorithms, of which TSSA is representative, that implement a video image block matching operation in a sequence of multiple procedural steps, involve greatly reduced amount of computation, since not all blocks that may have been displayed are compared. However, these algorithms require that subsequent process steps be performed in a defined procedural step sequence. This is a processing requirement that is not suitable for parallel processing. As a result, relatively very high throughput, low latency, and programmability are factors which must be considered if implementation of this category of motion estimation algorithms is to be sucessful.
Tree-architecture is an ideal hardware configuration for implementing such a motion estimation algorithm. However, conventional tree architectural configurations require the use of a large number of process elements for substantial implementation. High latencies are thus produced in the process pipelines, constaining the processing clock rate. For the purpose of outlining the characteristics of the invention, an example of such a conventional tree-architecture using four channels is briefly examined below, with reference to the accompanying drawings.
Block matching algorithms employ a scheme for computing the mean absolute errors (MAE) in the compared video image blocks as the basis for the measurement of the level of image matching. Blocks with a minimum MAE are considered to be matched blocks. In practice, an MAE for a compared image block is determined by first summing all the absolute values of the characteristic value differences between all picture pixels in the original image block and picture pixels in the corresponding compared image block, and then dividing the summed absolute value by the total number of pixels in the processed image block. The characteristic value of the picture pixels to be differentiated among the compared image blocks is normally the intensity value of the displayed pixel. By definition, an original block of a video image is the one currently being processed, while its corresponding compared block is the same video image block after it undergoes image motion alterations. It is assumed that both the original block and the compared block consist of the same number of image pixels arranged in the same matrix. Thus, a hardware configuration for implementing such a block matching scheme involving the computation of the MAE must at least include circuitry elements capable of performing addition, subtraction, absolute value operations, and determination of the minimum in a series of values.
FIG. 1 schematically depicts a conventional four-channel tree-architecture for implementing the computation of an MAE in a block matching algorithm. It is assumed that both the original block and the compared block consist of four pixels. Each of the pixels in the original block and the compared block is represented numerically by display characteristic values, designated by the data values X1, X2, X3, and X4 and Y1, Y2, Y3, and Y4, respectively, fetched to the input end of the tree-architecture for processing, as shown in the drawing. Each of the pixel data values may be multi-bit data containing, for example, n bits. Thus, pixel data for the original and compared blocks may be expressed as: EQU X={x.sub.n-1, x.sub.n-2, . . . , x.sub.0 } and EQU Y={y.sub.n-1, y.sub.n-2, . . . , y.sub.0 }
wherein the x.sub.n-1 and y.sub.n-1 bits are the sign bits for the pixel data of the original and compared blocks, respectively, and the X and Y data may thus all be positive values.
The four-channel tree-architecture shown in FIG. 1 consists of a total of five computational stages that are functionally organized into four sections. It is assumed that each of the five computational stages requires one clock cycle to implement a computational result for provision to the successive stage in the process pipeline. In other words, the tree-architecture of FIG. I takes at least five clock cycles to complete the block matching operation, utilizing the sets of input pixel data X and Y.
The first functional section in the tree-architecture structure is the absolute difference section 100, which consists of four parallel computational members 105. Each of the four computational members 105 is required to determine the value .vertline.X-Y.vertline., or, specifically, to determine the absolute value of the difference between the X and Y pixel data values for each of the four pixels of the processed image block.
The second functional section subsequent to the absolute difference section 100 is the summation section 110, which includes two successive computational stages 112 and 114. Two parallel adder members 118 are arranged in the first computational stage 112, while one adder member 119 is in the second stage 114. Each of the adder members 118 in the stage 112 adds the outputs of two of the parallel computational members 105 in the first section 100. The outputs of the two adder members 118 in the computational stage 112 are then added together in the subsequent stage 114 by the adder member 119. Thus, the output of the summation section 110 is the summation of the four absolute values of the difference between the original and compared image blocks obtained in the absolute difference section 100.
The third functional section is an accumulation section 120 that consists of at least a single accumulating adder member 125. This is an independent adder member that adds the output of the up-stream functional section 110 to the value it already holds. When the video image blocks to be processed consist of a pixel matrix having more than four pixels, this accumulation section 120 can be controlled under proper resetting and output enabling schemes to process four pixels at a time. However, the depicted conventional tree-architecture, which can easily process four pixels in a pipeline, would require more clock cycles when the processed image blocks are larger than one multiple of four pixels.
Finally, in the last functional section, a minimum determining member 135 constitutes the minimum determining section 130, which determines the minimum value from among the outputs of the accumulation section 120. Essentially, each of the outputs received from the accumulating adder member 125 is compared to the current minimum value memory content of the minimum determining member 135, and the smaller value is stored as the minimum value in the memory.
Thus, after all the pixels in the processed image block have been processed by the tree-architecture circuitry of FIG. 1, a motion vector 140 may be obtained as the output of the architecture, which is representative of a measure of the relative image movement between the original and compared blocks of the video image.
FIG. 2 shows a schematic diagram of the computational member 105 for the absolute difference section 100 of the tree-architecture of FIG. 1. As shown in the drawing, the computational member receives video block image pixel data inputs X and Y from the original and compared blocks, respectively, for generation of the absolute value difference .vertline.X-Y.vertline. by the depicted circuitry. Assuming the notation Z=X-Y, .vertline.Z.vertline.=.vertline.X-Y.vertline. is therefore determined.
The Y data is provided to the input of an exclusive-OR (XOR) gate 210, the other input of which is tied to a constant logical "1". This is equivalent to obtaining the one's complement 215 of the Y data which is provided to an adder 220 for addition with the X data. Thus, the adder 220 turns out the value X-Y at its S output, while the carry-out bit 224 at the CO output of the adder 220 signifies the sign bit of this effective subtraction performed by the adder. Notice that while the data X and the one's complement of the data Y are added together by the adder 220 to obtain the X-Y value, a carry-in bit (CI) having a constant logical value of "1" is also added into this summation operation, in order to perform subtraction by addition of the two's complement of the data Y.
The summation result of the adder 220, i.e., the X-Y value generally identified by the reference numeral 225, is exclusive-OR-ed by the XOR gate 230, utilizing the inverted version 226 of the carry-out bit 224 of the adder 220 as the conditioning bit. An inverter 222 is used to provide this inverted version of the carry-out bit 224. This allows the one's complement of the value X-Y to be provided to an input of another adder 240 if the inverted carry-out bit 226 of the adder 220 is a logical high. On the other hand, if the inverted carry-out bit 226 is a logical low, the output result of the adder 220 can be directly provided to the B input of the adder 240. The other, A, input of the adder 240 is tied to a constant logical "0". The carry-in input CI of the adder 240 is also driven by the inverted version of the carry-out output of the adder 220.
Such a double-adder arrangement as depicted in FIG. 2 provides the absolute value of the difference between the input X and Y data, which is held in the register 250 for further processing. However, this circuitry has at least the following obvious disadvantages for practical application in the block matching scheme used in video image processing.
First of all, since the absolute difference section 110 requires two cascaded stages of adders to obtain the absolute value result, time latency in the processing pipeline of the tree-architecture for computing the block matching MAE becomes tight. This directly translates into a constraint on the clock frequency that can be applied to the circuit utilizing this architecture.
Secondly, the total number of adders required in constructing the tree-architecture for implementing the computation of a block matching MAE is large and increases greatly with the number of pixels. The number of required process elements increases accordingly. This adds to the overall complexity of semiconductor fabrication.
Thirdly, as the channel number in the architecture is increased to simultaneously process more pixel data, the total number of processing stages in the process pipeline is also increased by one stage. This further increases the time latency in the tree-architecture.
Further evaluation of the second adder 240 of the computational member 105 shown in FIG. 2 for the absolute difference section 100 in FIG. 1 reveals that the adder 240 is used merely to add the logic value at the B input to a constant nil "0" at the A input. As a matter of fact, only the sign bit 226 is added to the carry-in input CI of adder 240. On the other hand, careful examination of the block diagram of FIG. 1 shows that each of the adder members 118 in the summation section 110 of the tree-architecture of FIG. 1 utilized to sum the outputs of two corresponding computational members 105 does not use its carry-in inputs. The second stage adder member 119, which adds the two adder member 118 outputs together, also has a carry-in input that is left unused.
A tree-architecture based on the concept of making use of these idle adder inputs in the process pipeline, known as the hierarchical search algorithm (HSA), as mentioned above, has been developed by the present inventors and is illustrated in the block diagram shown in FIG. 3. Such an architecture is disclosed in U.S. patent application Ser. No. 08/666,987, filed Jun. 19, 1996, which disclosure is incorporated herein by reference. Similar to the architecture of FIG. 1, this is another four-channel tree-architecture design having a total of five processing stages, also similarly categorized into four functional sections. This HSA block image processing circuitry is capable of an improved, smooth pipeline operation. After the initial five clock cycles, a resultant motion vector 140 is generated once every clock cycle.
As shown in the block diagram of FIG. 3, the first functional section, the absolute difference section 300, consists of four parallel computational members 305. An example of an implementation of the computational member 305 is illustrated in the schematic diagram of FIG. 4. Based on the concept of HSA architecture, the computational member 305 can be considered to be a simplification of the member 105 of FIG. 2. This simplification is possible, as mentioned above, by making use of the idle adder inputs in the process pipeline. When compared with the computational member 105 of FIG. 2, it is noted that the member 305 of FIG. 4 includes only one adder, rather than two.
Notice that for the purpose of smooth pipeline operation, registers are assigned to each output of each computational member 305, adder member 318 and 319, and to the accumulating adder member 325. These registers are used to hold the intermediate data and pass the data to the corresponding subsequent stage of elements synchronously, so that smooth pipeline operation can be achieved.
In the second functional section, subsequent to the absolute difference section 300, namely, the summation section 310, two successive computational stages 312 and 314 are included. As in the case of the conventional four-channel tree-architecture shown in FIG. 1, two parallel adder members 318 are arranged in the first computational stage 312, and one adder member 319 is included in the second stage 314. Each of the adder members 318 in the first computational stage 312 adds the outputs of two of the parallel computational members 305 from the first section 300. The outputs of the two adder members 318 in the first occupational stage 312 are then added together in the subsequent stage 314 by the adder member 319. Basically, this is a configurational arrangement equivalent to that of FIG. 1, except that the carry-in inputs of the adder members 318 and 319 are utilized for the summing manipulation required in the process for obtaining the MAE.
Next, the third functional section following the second is an accumulation section 320 that consists of at least a single accumulating adder member 325. This independent adder member 325 adds the output from the summation section 310 to the value it already holds. When the video image blocks to be processed consists of a pixel matrix of more than four pixels, the accumulation section 320 can be used to process four pixels at a time.
In the last functional section, a minimum determining member 335 constitutes the minimum determining section 330. This determines the minimum value from among the outputs of the accumulation section 320. Each successive output of the accumulating adder member 325 received is compared to the current minimum value memory content of the minimum determining member 335, and the smaller value is stored as the minimum in the memory.
In the HSA architecture of FIG. 3, the computational member 305 used in the absolute difference section 300 can be one that is electronically simple compared to its counterpart in the architecture of FIG. 1. As mentioned above, only one, rather than two, adder is require to construct a computational member 305. Approximately one-third of the hardware process elements of the design of FIG. 1 can be eliminated using the HSA design.
However, if the channel number is doubled, such HSA architecture still requires a tremendous increase in the total number of process stages in the pipeline. When the total channel number in the tree-architecture design is increased to a certain point due to practical applications in video image processing, the resulting time latency is increased to a level which greatly reduces the HSA pipeline processing efficiency.