It is common for graphics processing applications and image processing applications to define a processing pipeline implemented with several compute kernels. In this context each compute kernel is typically a program applied to an array of threads.
In the OpenCL™ language a kernel is a function declared in a program. A work group is a collection of work items that executes on a single processor unit. The work items in a work group execute the same kernel and share memory resources, logically executed in parallel.
A pipeline may include a first stage of processing on an image followed by feeding the resulting image into the next processing stage. It is common for applications to be written so that one compute kernel processes the whole data set before the next computer kernel is run. That is, a first kernel performs a first stage of processing on an image and a second kernel performs a second stage of processing on an image. However, this normally requires accessing external memory because the data set sizes are typically too large to fit into on-chip caches or local memory. For a large data set, like a typical camera image or decoded frame of video, this means reading and writing all of the data for each kernel executed from the external memory (e.g., external dynamic random access memory). However, reading and writing all of the data for each kernel from the external memory consumes power and has other disadvantages associated with data traffic to the external memory.