There are numerous instances in which it is desirable to compare sets of data. Many of these involve recognition of a data set by comparing the set whose identity is unknown with a multitude of known data sets to locate a best match. Illustrative such applications are the extraction of meaningful information from bad communication channels and the recognition of characters and other objects presented on film, video or other display.
In other applications, identification per se may not be the focus but the object of the comparison is still to find the best match. Examples include the comparison of a fragment of a gene sequence with an entire gene sequence or the comparison of sequences of computer software. One specific application is in determining the relative movement of video data between the frames of a video signal. In this case, the data set of interest is a block of pixels in the video frame. The motion of such a block of pixels is quantified by a displacement vector, which indicates the best match for a block of a current frame with one of a plurality of blocks found within a search window defined in a previous frame.
In general, the error between a first set of data having X elements and a second set of data also having X elements in a search window can be represented mathematically by ##EQU1## where x is a position in the set of data, .increment.x is the relative displacement between the position of the first set of data and the position of the second set of data, C(x) is a measure of a parameter (or parameters) of interest at position x in the first set of data and P(x+.increment.x) is a measure of the parameter (or parameters) of interest at position x+.increment.x in the second set of data. Thus, the error is calculated by determining for each element in the first set of data, the absolute value of the difference between the parameter of interest at that element and the parameter of interest at the corresponding element of the second set of data and summing these absolute values for all the elements in the first set of data.
In like fashion, the error can be calculated between data sets organized in more than one dimension. For the case of motion estimation in a video display which involves two-dimensional arrays of data representative of signal intensity, the error between a current block having X.times.Y pixels and a previous block in the search window can be represented mathematically by ##EQU2## where (x,y) is a position in rectangular coordinates in the current block, (.increment.x,.increment.y) is the displacement between the position of the current block and the position of the previous block in the video frame, C(x,y) is the intensity of a pixel at coordinates (x,y) in the current block, and P(x+.increment.x, y+.increment.y) is the intensity of a pixel in the previous block at coordinates (x+.increment.x, y+.increment.y). Thus, the error is calculated by determining for each pixel in the current block the absolute value of the difference between the intensity at that pixel and the intensity at the corresponding pixel in the previous block in the search window and summing these absolute values for all the pixels in the current block.
To determine the best match, the intensities of the pixels of the current block are compared to the intensities of the corresponding pixels of the search blocks defined within the search window of the previous frame. The accumulated difference in pixel intensities between two blocks is referred to as an error value. The block in the search window which most closely matches the current block is the one having the minimal error value. This block is identified by the displacement vector.
Motion estimation by block matching is a very computation-intensive process. For example, typical values for block size are 16 pixels.times.16 pixels with approximately 357 blocks per frame. To compare such a block with any other block requires 256 comparisons, one for each pixel. A reasonable size search window for each block is an array of 16.times.16 blocks. Accordingly, for a frame rate of 30 frames per second, the number of comparisons to be made each second is 30 frames/second.times.357 blocks/frame.times.256 search blocks/block.times.256 comparisons/search block=701,890,560 comparisons/second. If each comparison takes 6 RISC-type instructions per pixel, the amount of processing required is 4,211 MIPS. This is roughly 100 times the processing power of high-performance DSP or RISC chips.
Numerous other applications likewise need enormous amounts of processing power to perform similar types of comparisons. For example, to compare a fragment of a gene sequence of 100 nucleotides against the entire human genome which is approximately 3,000,000,000 nucleotides in length would require 300,000,000,000 comparisons.
To achieve the processing power required for matching large quantities of data, it is desirable to use a multiprocessing architecture. For example, several such systems have been described for matching the blocks of video signals.
For example, A. Arteri, et al., "A Versatile and Powerful chip for Real Time Motion Estimation" ICASSP - 89, vol. 4, pp. 2453-2456, describes a systolic architecture wherein the processors are organized as a two-dimensional array and each processor is associated with one possible match of a current block and a block within a search window. At each clock cycle, each processor receives the same current-block data and the search-window data for a different search block. All the processors complete the corresponding computations simultaneously, and transfer the results into a storage array. Then, the minimal error is determined. This architecture is not completely parallel, and it is not pipelined. Significant post-processing and pre-processing stages are necessary in order to provide all the processors with the appropriate data simultaneously and to determine the minimal error. There is a substantial memory requirement at the input and output stages, and each processor of the proposed system contains various input and output storage registers. Accordingly, this system fails to satisfy the requirements of a practical motion-estimation system.
V. Considine, et al., "Single Chip Motion Estimator for Video CODEC Applications," Third International Conference on Image Processing and its Applications, pp. 285-289 (July, 1989) relates to another dedicated VLSI multiprocessing architecture for motion estimation. This architecture also employs a two-dimensional array of identical processing units. The inputs to the array are connected to a search-window memory. The current-block pixels are loaded into the memory provided within the array and the search-window pixels are transferred to the array from the search-window memory. The processing units determine the differences between the current block and the search-window pixels. The differences are then summed by a summing tree provided at the outputs of the array. The chip illustrated in this reference contains two structures which include the search-window memory, the array, and the summing tree. This architecture does not provide for a truly pipelined processing. The processing begins only after the data representing the current block and the search-window pixels is stored in the array and in the search-window storage.
R. Dianysian, "Bit-Serial Architecture for Real Time Motion Compensation," Proceedings of the SPIE--The International Society for Optical Engineering, vol. 1001, pt. 2, pp. 900-907 (November, 1988) relates to an attempt to provide a bit-serial architecture, which is parallel and pipelined. This architecture is based on a two-dimensional grid of processors and distributed storage registers. The current-block data is loaded into the storage registers prior to the computational process. Then, the search-window data is shifted through the interconnections of the two-dimensional grid of processors. Since this system requires loading the current-block pixels into the processor array prior to the error computation, the processing is not parallel and pipelined at the transitions from one current block of the video signal to the next.
These prior-art systems fail to take full advantage of parallel and pipelined processing. To achieve the throughputs desired, it is necessary to minimize the pre-processing, post-processing, and storage of data. Ideally, the system should generate a continuous stream of error values in response to a continuous input of data. In addition, the desired multiprocessing architecture has to be flexible, so that various parameters of the system, for example, the block sizes, can be varied without additional design effort. Advantageously, the implementation of the system should also be based on a standard cell technology, rather than on custom VLSI.