As semiconductor technology continues to inch closer to practical limitations in terms of increases in clock speed, architects are increasingly focusing on parallelism in processor architectures to obtain performance improvements. At the chip level, multiple processor cores are often disposed on the same chip, functioning in much the same manner as separate processor chips, or to some extent, as completely separate computers. In addition, even within cores, parallelism is employed through the use of multiple execution units that are specialized to handle certain types of operations. Pipelining is also employed in many instances so that certain operations that may take multiple clock cycles to perform are broken up into stages, enabling other operations to be started prior to completion of earlier operations. Multithreading is also employed to enable multiple instruction streams to be processed in parallel, enabling more overall work to performed in any given clock cycle.
One area where parallelism continues to be exploited is in the area of execution units, e.g., fixed point or floating point execution units. Many floating point execution units, for example, are deeply pipelined. However, while pipelining can improve performance, pipelining is most efficient when the instructions processed by a pipeline are not dependent on one another, e.g., where a later instruction does not use the result of an earlier instruction. Whenever an instruction operates on the result of another instruction, typically the later instruction cannot enter the pipeline until the earlier instruction has exited the pipeline and calculated its result. The later instruction is said to be dependent on the earlier instruction, and phenomenon of stalling the later instruction waiting for the result of an earlier instruction is said to introduce “bubbles,” or cycles where no productive operations are being performed, into the pipeline.
One technique that may be used to extract higher utilization from a pipelined execution unit and remove unused bubbles is to introduce multithreading. In this way, other threads are able to issue instructions into the unused slots in the pipeline, which drives the utilization and hence the aggregate throughput up. Another popular technique for increasing performance is to use a single instruction multiple data (SIMD) architecture, which is also referred to as ‘vectorizing’ the data. In this manner, operations are performed on multiple data elements at the same time, and in response to the same SIMD instruction. A vector execution unit typically includes multiple processing lanes that handle different datapoints in a vector and perform similar operations on all of the datapoints at the same time. For example, for an architecture that relies on quad(4)word vectors, a vector execution unit may include four processing lanes that perform the identical operations on the four words in each vector. The aforementioned techniques may also be combined, resulting in a multithreaded vector execution unit architecture that enables multiple threads to issue SIMD instructions to a vector execution unit to process “vectors” of data points at the same time. In addition, multiple execution units may be used to permit independent operations to be performed in parallel, further increasing overall performance.
Nonetheless, a number of different types of calculations still present performance problems for conventional processing units. For example, several computer graphics shading effects rely on one minus dot product vector floating point calculations that can limit performance in a processing unit.
Two of these computer graphics shading effects, the Fresnel effect and the “electron microscope” effect, seek to improve the realism of an image by highlighting the edges of objects. Both effects have been found to require calculations that have a necessity to increase the intensity of pixels as their surface normals in 3D space grow more perpendicular to the viewer. Typically, to calculate the intensity, both of these techniques take the 3-word dot product of the surface normal with the view vector, and subtract that result from 1.0, a calculation that is referred to hereinafter as a one minus dot product vector floating point calculation.
Conventionally, the one minus dot product vector floating point calculation requires two separate calculations, each initiated by a separate floating point instruction. The first calculation is the dot product calculation, and the second calculation is a subtraction calculation, in which the result of the dot product calculation is subtracted from 1.0. Furthermore, since the result of the first calculation is used in the second calculation, the second instruction used to perform the subtraction calculation is dependent on the first instruction used to perform the dot product calculation.
A one minus dot product vector floating point calculation is typically performed per pixel for each object in a frame. Thus, for each viewable pixel in an object, two dependent instructions must be performed, causing the one minus dot product floating point calculation to be performance-critical.
A need therefore exists in the art for a manner of improving the performance of a processing unit is performing one minus dot product vector floating point calculations.