Multi-dimensional data processing involves a series of operations on data units, such as blocks of data, which are being processed in some pre-set order. For the purposes of describing prior techniques, and exemplary embodiments of the invention, the set of operations performed on a single block can be treated as an atomic unit. Although parallelism within the atomic operation itself can be exploited to some degree, for optimal load balancing across multiple processors, the best way to achieve parallelism is to parallelize the atomic operations across multiple blocks. Each block of data is assumed to depend on a certain causal set of neighborhood blocks. If block A depends on block B, then the atomic operation on block A cannot proceed until the corresponding operation on block B is complete Causality implies that, given the pre-set order for processing the blocks in the multi-dimensional data set, each block can only depend on blocks that precede it in the order of processing.
In the case of image processing, each block of data can represent a rectangular group of pixels and the atomic operation could consist of: (a) some form of causal prediction based on neighborhood blocks, (b) forming a residual data block, and (c) applying a discrete cosine transform on the residual block. Similar situations exist in video where the blocks are three-dimensional and may represent pixels from multiple frames that can be grouped into a cube. In the case of protein folding techniques, the blocks may represent amino acids arranged using a popular lattice model like the hydrophobic-polar (H-P) model. Applications that require such processing may range from video and/or image compression to running queries on multi-dimensional objects in a database.
With reference not to FIG. 1, in the one-dimensional case, it is trivial to show that no parallelization of the atomic operations is possible if each block 102 is dependent on the immediately previous block. This is caused by the fact that the dependencies on adjacent blocks would form a “dependency chain” (illustrated by arrows 104) which prevents any attempt at parallelism. On the other hand, if each block 106 were to only depend on the block before the immediately previous block, as indicated by arrows 108, then two adjacent blocks can be simultaneously processed at a time.
In the case of multi-dimensional data processing, there is more than one order in which blocks may be processed. A popular order is the raster scan order for images, in which each row of blocks in the image is processed from left to right and on completing the row the rows below it are processed one by one in a similar manner. FIG. 2 shows raster scanning of rectangular blocks 202 within a two-dimensional data set 204.
In prior art, instruction-level parallelism is a well-researched topic and describes methods for both on-line and off-line techniques for finding concurrently executable snippets of code. It finds use in optimizing code execution on VLIW (very long instruction word) and SIMD (Single Instruction Multiple Data) architectures. However, it does not solve the problem of optimizing computational performance or load balancing across a cluster of individual processors.
Another well-known method for increasing the speed of processing is to pipeline the sequence of operations applied on each individual data block This method suffers from the problem that the speed of the pipeline is limited by the slowest processing step among the series of steps applied on each block.
U.S. Pat. No. 5,010,515 discloses a parallel graphics processor with workload distributing and dependency mechanisms and method for distributing workload. An interactive 3-dimensional computer graphics display system has an arbitrary number of parallel connected graphic arithmetic processors (GAPS) coupled to an applications processor through a display list management module and coupled to an image memory unit that generates video output. High level commands from the applications processor are distributed for substantially equal temporal processing among the GAPS by delivering the commands to that GAP which is most ready to receive the next command. Each GAP has a FIFO input memory. A plurality of priority levels are established related to GAP FIFO input emptiness. An additional priority scheme is established within each FIFO emptiness level using a daisy-chained grant signal. A command bus includes dedicated lines for control signals between the GAPs to signal the priority and to pass along the grant signal. Sequentiality of the output from the GAPs is maintained by codes included in command headers and monitored by special tag FIFO memories resident on each GAP, which maintain an entry for each sequential command executed by any of the GAPs. The entry indicates that the command is sequential, and whether it is being executed by the GAP upon which the tag FIFO resides. A GAP output controller signals to all other GAPs when it has reached the stage where the next command it will send is a sequential command.
U.S. Pat. No. 6,477,281 discloses an image processing system having multiple processors for performing parallel image data processing. An image processing apparatus is composed of an image memory comprising a plurality of memory elements, and a processor unit comprising a plurality of processor elements. By suitably engineering the arrangement of the image memory of memory elements, the arrangement of the processor unit of processor elements and the connections among the memory elements and processor elements, it is possible to realize, through a simple construction, high-speed image processing such as image density conversion, image color conversion, image masking, image spatial filtering, image enlargement, image reduction, image rotation, image compression, image extension and image color correction. Also provided are an image processing apparatus for a color image, composed of a plurality of image memories and a plurality of processor elements, and a generalized, multifunctional data parallel processing apparatus capable of processing a large volume of data at high speed.
U.S. Pat. No. 5,510,842 discloses a parallel architecture for a high definition television video decoder having multiple independent frame memories. A parallel decoder for an MPEG-2 encoded video signal includes a deformatter which separates the input bit stream into multiple portions, each representing a respectively different section of the HDTV image. The separate portions are processed in parallel by respective decoders. In order to perform motion compensated processing, each of the four decoders includes a memory that holds data represents the entire image. Each decoder provides its decoded output data to all of the decoders to maintain the data in the respective memories.
It would be desirable to overcome the limitations in previous approaches.