As semiconductor technology continues to inch closer to practical limitations in terms of increases in clock speed, architects are increasingly focusing on parallelism in processor architectures to obtain performance improvements. At the chip level, multiple processor cores are often disposed on the same chip, functioning in much the same manner as separate processor chips, or to some extent, as completely separate computers. In addition, even within cores, parallelism is employed through the use of multiple execution units that are specialized to handle certain types of operations. Pipelining is also employed in many instances so that certain operations that may take multiple clock cycles to perform are broken up into stages, enabling other operations to be started prior to completion of earlier operations. Multithreading is also employed to enable multiple instruction streams to be processed in parallel, enabling more overall work to performed in any given clock cycle.
One area where parallelism continues to be exploited is in the area of execution units, e.g., fixed point or floating point execution units. Many floating point execution units, for example, are deeply pipelined. However, while pipelining can improve performance, pipelining is most efficient when the instructions processed by a pipeline are not dependent on one another, e.g., where a later instruction does not use the result of an earlier instruction. Whenever an instruction operates on the result of another instruction, typically the later instruction cannot enter the pipeline until the earlier instruction has exited the pipeline and calculated its result. The later instruction is said to be dependent on the earlier instruction, and phenomenon of stalling the later instruction waiting for the result of an earlier instruction is said to introduce “bubbles,” or cycles where no productive operations are being performed, into the pipeline.
One technique that may be used to extract higher utilization from a pipelined execution unit and remove unused bubbles is to introduce multithreading. In this way, other threads are able to issue instructions into the unused slots in the pipeline, which drives the utilization and hence the aggregate throughput up. Another popular technique for increasing performance is to use a single instruction multiple data (SIMD) architecture, which is also referred to as ‘vectorizing’ the data. In this manner, operations are performed on multiple data elements at the same time, and in response to the same SIMD instruction. A vector execution unit typically includes multiple processing lanes that handle different datapoints in a vector and perform similar operations on all of the datapoints at the same time. For example, for an architecture that relies on quad(4)word vectors, a vector execution unit may include four processing lanes that perform the identical operations on the four words in each vector. The aforementioned techniques may also be combined, resulting in a multithreaded vector execution unit architecture that enables multiple threads to issue SIMD instructions to a vector execution unit to process “vectors” of datapoints at the same time.
Vector-based floating point execution units, in particular, are optimally configured for many image processing applications, as many of the floating point arithmetic operations performed in image processing may be performed on multiple datapoints at a time, e.g., on each coordinate of a vertex from a graphical primitive. As a result, from both a performance and a logical standpoint, it is often beneficial to use vectors to store and manipulate various types of image-related data.
Despite the performance improvements that can be obtained in vector-based floating point execution units, however, such execution units are not always optimally configured for certain types of algorithms. Floating point algorithms that rely on loops, or that otherwise require conditional branches and decision making, for example, may not perform optimally in some conventional hardware designs, particularly designs that incorporate branch prediction. In many conventional hardware designs that rely on branch prediction, a branch prediction unit is tightly coupled with a fixed point execution unit, while the vector floating point execution unit is logically and physically separate therefrom on the chip. Vector compares often require special synchronization with the branch unit, in part due to the fact that the condition register upon which the branch conditions are evaluated is local to the branch unit, which delays processing due to the communication between these distinct units. In addition, branching based on vector compare results can lead to costly branch mispredicts, resulting in comparatively large performance penalties due in part to the need to flush the floating point execution pipeline whenever a branch is mispredicted. In a vector floating point execution unit with a depth of six, for example, a branch mispredict may result in as much as a 20-30 cycle performance penalty.
One example of an image processing algorithm that may perform sub-optimally in some conventional hardware designs is a z-buffer test algorithm. In three-dimensional (3D) graphics applications, great care must be taken to avoid drawing objects that would not be visible, such as when an opaque object is closer to the camera than another object. In such a case, the object closer to the camera would block the farther object, and a 3D application that is attempting to draw this scene must not draw the further object. Often 3D computer graphics rasterization applications will employ what is called a “z-buffer” to handle this case. The z-buffer is a set of values that represent distance from the camera (sometimes called depth) for each pixel. Every time the rasterization algorithm is ready to draw a pixel, it compares the depth of the pixel it is attempting to draw with the depth of the z-buffer for that pixel. If the z-buffer value indicates that the existing pixel is closer to the camera, the new pixel is not drawn and the z-buffer value is not updated. In contrast, if the new pixel to be drawn is closer to the camera, the new pixel is drawn and the z-buffer is updated with the new depth associated with that pixel.
Table I, for example, illustrates exemplary pseudocode suitable for testing the z-buffer in association with drawing a new pixel:
TABLE IZ-buffer Test Pseudocodeif (previous_zval_for_pixel(x,y) > zi){  zbuffer(x,y,zi);}
Typically, implementation of such pseudocode in a conventional processing unit incorporates a comparison of the depth of the current pixel (zi) with the value stored in the z-buffer, followed by a conditional branch that skips a subsequent write instruction if the comparison indicates that the new pixel is at a greater depth than that stored in the z-buffer, and is thus occluded by the existing pixel. Branch mispredicts for the conditional branch typically incur substantial performance penalties due to the need to flush the vector floating point execution pipeline, while even in the absence of mispredicts, a performance penalty is typically incurred due to the need for synchronization between the vector floating point execution unit and the branch unit. Considering that this test may be performed millions of times as each new pixel is processed, the overall performance impact can be substantial.
An alternative method for implementing such an algorithm in a processing unit relies on predication, where one instruction sets a predicate register in the vector floating point execution unit, and then a following instruction looks at the predicate register, and takes different actions based on its value. For example, Table II below illustrates an implementation of a z-buffer test routine that uses predication, written in PowerPC VMX assembly code:
TABLE IIZ-buffer Test PowerPC VMX Assembly Codeloop:  lvxvr1,ra,rb  vcmpgtfpvr2,vr1,vr0  vselvr3,vr2,vr1,vr0
It is assumed that, in this code, vr0 contains the new x,y,zi coordinate under test. The lvx instruction loads the previous z-buffer value for the current (x,y) coordinate, the vcmpgtfp instruction performs a greater than compare between the previous z-buffer value and the current zi coordinate, and sets vr2 to all 1's if true. The vsel instruction then writes either the new zi coordinate or the old z-buffer value to vr3 based upon the comparison.
The use of predication eliminates the need to use the condition register and thus the risk of mispredicts by a decoupled branch unit, but consumes a temporary register in the vector floating point execution unit. In addition, a dependency exists between the vcmpgtfp and vsel instructions, requiring the vsel instruction to be stalled from executing until the result of the compare is complete. Assuming, for example, a vector floating point execution unit pipeline depth of six, the vcmpgtfp and vsel instructions would require 12 cycles to complete. Again considering that this test may be performed millions of times as each new pixel is processed, the overall performance impact of predication due to the introduction of dependencies can be significant.
Therefore, a substantial need continues to exist in the art for a manner of improving the performance of z-buffer test algorithms, as well as other similar algorithms that may otherwise introduce dependencies and/or branch mispredicts in conventional hardware designs.