An image processing function, or kernel, that implements a point operation on an image can be easily mapped to a SIMD processor and efficiently chained. This is because the order of the pixels presented to each SIMD processor is unimportant since each result pixel only depends on one source pixel.
Many image processing kernel functions, however, determine neighboring context to generate an output pixel value. To calculate the new value of a pixel, the kernel often reads surrounding pixel values. Many ways exist to map such kernels to a SIMD processor, where each mapping uses a different partitioning of data among the SIMD processing lanes or traverses the data in a different order. Usually, performance optimized mappings vary based on the underlying algorithms being implemented, which is partly why such a diversity of implementation strategies abound. Because of these differences, image processing kernels cannot be guaranteed to easily chain together without “glue logic” that transposes data between SIMD processing lanes or via an extra global memory transfer. This both reduces performance and lowers productivity.
While existing solutions allegedly work adequately for their intended applications, they are often inflexible in accommodating a large set of image processing algorithms, especially when little to no loss of performance is desired. Thus, improved mapping methods and apparatuses for image processing in SIMD processors are described herein.