The trend of achieving performance speedup through the use of multi-core and multi-processor architectures in computer systems, including printing systems, has gained widespread use in recent years. Multi-core and multi-processor architectures allow a number of computation threads belonging to one or more processes to execute in parallel across a number of cores or processors to reduce the overall execution time.
Modern GPUs (Graphics Processing Units) typically contain hundreds or thousands of simple processing cores, in what is also referred to as a many-core architecture. GPUs are suited to applications and algorithms that can be executed with a large degree of parallelism, requiring little or no communication among parallel threads and minimal branching within a thread's instruction stream. Extracting such a large degree of parallelism from applications originally designed for sequential execution is often difficult, but the potential benefits of GPU processing can be large.
Both multi-core and many-core architectures, which may include one or more CPUs and/or one or more GPUs, coexist in today's computer systems ranging from supercomputers, workstations, home-use desktops to mobile devices. A computer system with different kinds of computational processors such as CPU and GPU is referred to as a “heterogeneous system”. A GPU can be implemented on a discrete card or integrated with the CPU on a chip. The ultimate goal of using CPUs and GPUs to run an application in parallel is to achieve maximum performance. However, only some tasks can be executed on GPU effectively. Determining what tasks are more suitable and effective to run on a GPU rather than a CPU can be difficult.
In a print rendering system, a high-level description of a page in the form of a page description language (e.g., PDF or XPS) is received and converted into pixels by a rendering engine for printing. The rendered pixels in RGB colours need to be converted into process colours (for example, CMYK) to be sent to a printer. The PDF interpretation is usually performed by a CPU due to the complexity of the objects described in the page. The rendering task may be executed on the CPU or some parts on the CPU and other parts on a GPU. The colour conversion task (e.g., for non-colourmetric conversions) might perform better if it is performed by the GPU because converting each pixel colour requires simple instructions and it is independent of other pixels. This allows pixel-parallel processing for all pixels in a page or a portion of a page.
The rendered pixels after colour conversion have continuous tone (referred to as “contone”) colour components, each of which is typically at least 8 bits per channel. A CMYK colour printer uses three different types of toner, namely cyan, magenta and yellow to create colourful images. An extra channel (black) can be used to print fully black pixels (e.g., for text) and to create very dark colours. A printing device might however have a colour depth of fewer than 8 bits per channel. Modern printing devices can realise more than 1 bit per channel, by having dots varying in size, shape or spacing, thus representing pixels of 2 bits or 4 bits per channel. The process of converting the page pixels in contone colour components into process colour components suitable for printing is called halftoning.
Halftoning can consume significant computing resources because each of the four 8-bit components of every pixel needs to be converted into a smaller number of bits. There are numerous methods of halftoning using sequential algorithms on a CPU to improve halftoning performance in terms of quality and speed, for example, using dither patterns (or matrices), error diffusion and stochastic halftoning. Some known methods use GPUs to halftone the rendered pixels using a dither matrix method or an error diffusion method. A page is sub-divided into tiles, each of which is dithered using a GPU method using dither matrices or error diffusion. In this case, after configuring a number of parallel threads to dither on a GPU, a GPU kernel is launched to start parallel dithering of the pixels in a tile. The CPU is not used at the same time as the GPU dithering process, except for waiting for the result of dithering on the GPU.
Another known error diffusion method combines a CPU dithering method for some portion of the rendered page and a GPU method for another portion in parallel. In error diffusion, the quantization residual of an output colour in a pixel is distributed to its neighbouring pixels that have not yet been processed. For example, one pixel can only be dithered after its left hand pixel, top-left hand pixel, top pixel and top-right hand pixel have been dithered. This data dependency restricts halftoning of the pixels at the top-left hand corner of the image by the sequential method on the CPU. Once the pixels along the diagonal of the corner of the image are halftoned on the CPU, the pixels next to the diagonal can be dithered independently, i.e., they can be dithered in parallel. Some pixels can be processed by a CPU and others by a GPU in parallel. However, selection between the CPU and the GPU is based only on a position of a pixel within the tile, which may not be efficient for some input data.