Parallel processing refers to the division of a data processing task among multiple computer processors in order to decrease the amount of time required to complete the processing task. Graphics processing units (“GPUs”) have multiple processors and can therefore be used to parallelize data, either alone or together with one or more dedicated computer processing units (“CPUs”). A typical approach for GPU data-parallelism is to construct a grid (or NDrange in OpenCL terminology) in which each thread is in charge of a single data element. This formulation has its own advantage of being simple and is suitable for simple parallel problems where data elements are processed independently such as color transform, pixel thresholding, image subtraction and so on. For more complicated problems which involve neighboring pixels and reduction (like the block matching problem being discussed), this simple approach has several drawbacks. First, matching a block (or filtering) at two adjacent pixels will create data access overlaps, which will increase latency because parallel threads have to wait for each other to access the same memory address. Second, the size of the block to be matched introduces another dimension for parallelization, and exploiting this dimension (rather than adhere to the image dimensions) may be more effective.
Third, the reference image may be very large in real application, for example tens of Megapixels, which surpasses the maximum grid dimensions of any CPU. In order to cover the whole image, the program must contain mixed host code (on the CPUs) and device code (on the GPUs) to recursively or iteratively process all portions of the image. This introduces more complexity to the design and implementation of the parallel program.
It would therefore be advantageous to provide a new method for multiple-cache parallel computing to reduce the number of steps needed to execute a thread and provide a more efficient computing process.