In a basic computer system, a single, multi-function processor is typically used to implement all instructions provided by one or more computer programs. The usual way to improve the speed of such a processor is to increase the speed of the clock supplied thereto. However, as is well known in the art, there are physical and material is limitations upon the clock speed at which any particular processor can be driven.
One way of ameliorating such a problem is pipelining, in which each program instruction is broken down into sequential steps required for execution. Such steps can include, for example, fetching an instruction from memory, decoding the instruction, fetching data required for the instruction from memory, executing the instruction using the retrieved data, and writing the result to memory. By implementing each of these steps in a separate processor module associated with the main processor, it is possible to significantly increase the throughput of instructions.
One situation in which pipelining is particularly advantageous is the processing of large amounts of streamed data. An example of this is image processing, in which filtering operations frequently require the repeated application of a particular instruction or instructions to each pixel in an image.
There are also some disadvantages in pipeline processing. For example, the results of a conditional branch instruction will not be known until at least the execution step mentioned above. This means that it is not possible to commence fetching the next instruction until a number of clock cycles have been wasted. When overheads associated with implementing pipelining are taken into account, this architecture can result in less efficiency than a well-designed single processor running at a similar speed.
Another method of increasing the speed at which at least certain types of programs run is to implement a plurality of processor units in parallel. A particular type of parallel arrangement is the VLIW processor, a simple processor architecture in which the instruction for each functional unit is contained within a dedicated field of a VLIW. Such a processor is “simple” because, although it processes a number of operations in parallel, all dependency and scheduling issues are handled by a program compiler, rather than in hardware. However, as with pipelined processors, VLIW processors can be relatively inefficient users of resources when running programs having particular operational characteristics.
In the context of graphics operations, generally the same operation is applied over and over again to each element of an input data stream. Often, the operations applied are complex and composed or expressed by a tree of primitive operations. An example of this is compositing, where several layers of objects are used to provide a final output image. For each output pixel, there is a multi-level compositing tree, made of primitive compositing operators. Likewise, calculation of colour space conversion requires a short computation tree, and convolution requires a hierarchy of multiplications and additions for each output pixel. Unfortunately, single level VLIW processors and single width pipelined processors do not necessarily provide an optimal solution to the need for additional processing speed in such applications.
It is an object of the present invention to provide a processor that overcomes or at least ameliorates one or more of the disadvantages of the prior art.