Graphical images are conventionally displayed on display devices which include a plurality of picture elements, or pixels. One such device is illustrated in FIG. 1 of the accompanying drawings. The display device 1 is made up of a plurality (for example 640×480, 800×600, to 1600×1200) of pixels (picture elements) 3 which are used to make up the image display on the screen as is well known. In order to display an image on the screen the colour value of each pixel must be calculated for each frame of the image to be displayed. The pixel information is typically stored in a “frame buffer” of the display device. Calculation of pixel colour values is known as “shading” and is advantageously performed by a dedicated graphics processing system. The use of such a dedicated graphics processing system in combination with a host system enables the processing power of the host system to be more effectively utilised processing applications software. The application software typically determines the geometry of the images to be shown on the display device and the graphics system takes geometrical information and calculates the actual pixel values for display on the device 1 from that information.
Commonly, the graphics processing system receives information from the host application in the form of information data regarding graphical primitives to be displayed. A graphical primitive is a basic graphical element which can be used to make up complex images. For example, a common graphical primitive is a triangle and a number of different shaped and sized triangles can be used to make up a larger more complex shape. The primitive data includes information regarding, the extent, colour, texture and other attributes of the primitive. Any amount of primitive information can be used. For example, simply colour and extent information may be sufficient for the application or primitive concerned. Visual depth information, ie the relative position, of the primitive can also be included. In the following examples, primitives having a high visual depth value are considered to be closer the viewer, ie more visible, than those primitives having lower visual depth value. Such a convention is arbitrary, and could be replaced by any other suitable convention.
FIG. 2 illustrates in side view, and FIG. 3 illustrates in front view, the display of primitives P1, P2 and P3 on the pixels 3 of the display device 1. Primitive P1 is the rearmost primitive, having a visual depth P1d which is lower than the visual depths of the other primitives. Primitive P3 is the frontmost primitive. As can be seen, the primitives overlap one another, and so the graphics processing system must calculate, for each pixel, which of the primitives is displayed at that pixel.
In the following examples, three pixels 3a, 3b and 3c will be used illustrate the graphical processing of the primitive data.
In a graphical processing system having a single processor, a primitive analysed so that a pixel covered by the primitive can be identified. A “fragment” of the primitive data is determined for that pixel, and is then processed to determine the colour to be displayed for the pixel. When one fragment has been processed, a further fragment can be identified and processed.
The graphics processor receives fragment information which contains data indicating the colour, texture and blending characteristics of the primitive concerned at a particular pixel.
A “shading” process is then used to process the fragment information in order to determine the actual pixel data which is to be written to the frame buffer of the display device for display thereon. The shading process results in the determination of the colour of the pixel from the fragment information. This may involve a texture look-up operation to determine the texture characteristics to be displayed at the pixel. A texture look-up involves a memory access step to retrieve the texel, or texture element, for the pixel. For opaque primitives, the colour information is supplied to the frame buffer where it overwrites the current value to give a new value for display.
The frame buffer contents can be displayed immediately, or could be displayed at an arbitrary time in the future (for example using multiple frame buffers for the device), and any suitable scheme can be used for the display device concerned.
FIG. 4 shows the final frame buffer values for the primitives arrangements shown in FIGS. 2 and 3. Pixel 3a will display the properties of primitive P1, pixel 3b will display the properties of primitive P2, and pixel 3c will display the properties of primitive P3.
A development of such a system uses a region-based processing scheme including a plurality of processors. As illustrated in FIG. 1, the pixels 3 of the display device 1 are grouped in to a number of regions, for example region 5. The region size is usually defined by the number of processors in the multiple processor system. One particular processing architecture could be a single instruction multiple data (SIMD) processing architecture. In a region based architecture, the primitives are sorted to determine which regions of the display include which primitives and are then subjected to “rasterisation” to break the primitives into fragments. The fragment information is stored for each primitive until all of the primitives have been rasterised. Usually, only the most recently determined fragment information is retained for each pixel. A shading process is then used to determine the pixel data to be stored in the frame buffer for display. Such a scheme has the advantage that the shading process can be used a minimized number of times by shading multiple pixels at the same time (using one processor per pixel) and by waiting until a high proportion of pixels are ready to be shaded. Such a scheme is known as “deferred shading” because the shading process is carried out after the rasterisation process has been completed.
Such a scheme works well when all of the primitives are opaque since deferring the shading operation enables large memory accesses (i.e. texture look-ups) to be deferred and performed in parallel. The result for opaque primitives will be as shown in FIG. 4.
A technique which can be used to provide transparent or partly transparent primitives is known as “blending”. In a blending process, the current pixel data stored in the frame buffer is combined with newly calculated pixel data relating to a new primitive. The combination is performed in a manner defined by the blending algorithm in accordance with a so-called α-value which indicates the amount of blending that is to be achieved, for example an α-value of 0.5 indicates that the result of the blend is to be half existing colour and half new colour. Blending occurs after the shading process. In the single processor case blending is performed immediately following the shading process for each pixel. The pixel data is blended in the order in which the associates primitives are output from the host system.
FIG. 5 illustrates the calculated frame buffer values for the primitives of FIGS. 2 and 3, where primitives P1 and P2 are blended, and P3 is opaque. Pixel 3a displays a blend of primitive P1 and the background pixel 3b displays a blend of P1, P2 and the background, and 3c displays only P3.
In the region based architecture, it is not practical to defer blending with the deferred shading process because of the requirement to store large amounts of data relating to all of the primitives occurring at a pixel regardless of whether those primitives are visible or not. This is necessary because a blended primitive can have an effect on the final values of the pixel. In such a case, the shading and blending processes must be carried out for a pixel as soon as a blended primitive is encountered. This results in low utilisation of a multi-processor design, since, on average, a single blended primitive is likely to cover only a small number of pixels and so the shading and blending processes must be carried out even though only a small number of the available processors have the required data. In addition, if shading and blending were to be performed for each primitive, many operations would be unnecessary due to overlapping primitives at a pixel.
Deferred shading for images including blended primitives has not been implemented for region based multiple processor graphics processing architectures, because of these problems.
It is therefore desirable to provide a graphics processing system which can defer blending and shading operations in order to provide higher performance and faster computation time.
Furthermore, conventional data processing techniques process data serially through different tasks. For example see FIG. 38 of the accompanying drawings which illustrates a conventional process in which data items (Data #1) are generated, for example by a result from a calculation or from a memory fetch operation, and are then processed by first task (task A). Task A results in new data (Data #2) for processing by a second task (task B) to produce result data® data). Conventionally these tasks need to be repeated for each new data item for processing.
In a single instruction multiple data (SIMD) architecture a number of processing elements act to process respective data items according to a single instruction at any one time. Such processing is illustrated in FIG. 39 of the accompanying drawings, which show processing by n elements.
With a single instruction stream it is necessary for all the n processing elements to perform the same tasks, although each processing element has it's own data: this is SIMD. Every processing element generates a new item of data (Data#1 0-Data#1 n). Each respective processing element then performs a respective Task A on its respective Data#1.
On completion of Task A, by each of the processing elements, some percentage (between 0% and 100%) of the processing elements will have a respective valid data item on which to perform a respective Task B. Since all the processing elements must perform the same Task at the same time, those without valid data are performing no useful work, and the set of processing elements, as a whole, are not working at full utilisation, i.e. maximum efficiency.
As the fraction of processing elements producing valid data, as a result of Task A, as input data (Data#2) to Task B decreases, the efficiency of the whole array of processing elements also decreases. Furthermore, as the “cost” of Task B increases, i.e. number of cycles* required to perform the task, the utilisation of the whole of the processing flow decreases.                (*—by way of an example, Fixed Point Processing requires approx 10 cycles for a typical 4 byte integer and Floating Point Processing requires approx 100 cycles for a 4 byte floating point number.)        
Clearly the flow through tasks A and B can be extended with further Tasks, i.e. Task C, Task D etc. The output data from Task B feeds into Task C and clearly if Task B eliminates the data, Task c will suffer under-utilisation, and so on. Further Tasks can be cascaded in this fashion, with utilisation rapidly decreasing through each step as data items are eliminated.