The trend of achieving performance speedup through the use of multi-core and multi-processor architectures in computer systems, including printing systems, has gained widespread use in recent years. Multi-core and multi-processor architectures allow a number of computation threads belonging to one or more processes to execute in parallel across a number of cores or processors, thereby reducing the overall execution time.
One such multi-core architecture is a graphics processing unit (GPU). GPUs are commonly used to accelerate the rendering of 3D graphics for viewing on a display device. However, in recent years, GPU manufacturers have enabled the general-purpose programming of their GPUs. This concept is commonly known as general-purpose GPU (GPGPU). GPUs and other types of architectures, such as single-instruction multiple-data (SIMD) units and field programmable gate arrays (FPGAs), are also known as stream processors.
Modern GPUs typically contain hundreds of simple processing cores, also known as processing elements. They are therefore suited to applications and algorithms that can be executed with a large degree of parallelism, requiring little or no communication among parallel threads and minimal branching within a thread's instruction stream. Extracting such a large degree of parallelism from applications originally designed for sequential execution is often difficult, but the potential benefits of GPU processing can be large.
Print data is often pre-processed to form some kind of region-based intermediate representation. This may be done to compress the print data and use less memory, to transfer the print data across a network, or to enable printing in real time, for example. Due to the nature of region-based intermediate representations, it is often difficult to process them with many parallel threads, including on GPU architectures.
The main difficulty is that the intermediate representation must be processed in a sequential fashion. Some prior art methods store the intermediate representation as a sequence of rasterised non-overlapping edges, grouped into tiles (also referred to as fillmaps) of the page. For each tile, the edges are typically sorted by their start coordinate, first by start y coordinate and then by start x coordinate. For each edge, the x-coordinates of the edge's crossings with successive scan lines are stored in a sequence. To reduce memory usage, each x-coordinate is typically stored as a difference between the x-coordinate and the previous x-coordinate (an encoding method known in the art as “delta encoding”). To render an individual portion of a tile using such a representation, all edges must be searched sequentially, the delta-encoded x-coordinates decoded, and the x-coordinates sorted along each scan line to identify which edge activates the pixels in the portion being rendered. This process, repeated for every portion of the tile, entails a large amount of redundant processing (similar steps are repeated for every portion) and is therefore very inefficient. For this reason, such a representation is suitable for processing sequentially, but is not suitable for processing in parallel.
Other prior art methods store the representation as a sequence of bits, each bit corresponding to a pixel of the page. Typically, an “on” bit represents the transition from one object to the next along the corresponding scan line and between successive scan lines. To render any given object or pixel, all bits of the representation up to the given object or pixel must be processed. It is therefore inefficient to render individual objects or pixels of the page in parallel using such a representation.
Other prior art methods store the representation as a sequence of non-overlapping regions bounded by vector-based edges. Each region is typically stored independently, with its bounding edges and other rendering information such as colour. This region-independence allows individual regions to be rendered in parallel, possibly using existing GPU-based pipelines such as OpenGL. However, when rendering regions in parallel, the number of processors that can be utilised simultaneously is limited to the number of regions. This limitation means that GPUs, which typically contain hundreds of processing cores, will not be fully utilised. Another approach to rendering such a representation in parallel is to render individual portions of the image in parallel. To render an individual portion, the regions must be searched and their edges evaluated to determine if a region contains the portion. As with other representations, rendering individual portions using this representation entails a large amount of redundant processing (similar steps are repeated for every portion), and is therefore not suitable for parallel execution.