As semiconductor technology continues to inch closer to practical limitations in terms of increases in clock speed, architects are increasingly focusing on parallelism in processor architectures to obtain performance improvements. At the chip level, multiple processor cores are often disposed on the same chip, functioning in much the same manner as separate processor chips, or to some extent, as completely separate computers. In addition, even within cores, parallelism is employed through the use of multiple execution units that are specialized to handle certain types of operations. Pipelining is also employed in many instances so that certain operations that may take multiple clock cycles to perform are broken up into stages, enabling other operations to be started prior to completion of earlier operations. Multithreading is also employed to enable multiple instruction streams to be processed in parallel, enabling more overall work to performed in any given clock cycle.
One area where parallelism continues to be exploited is in the area of execution units, e.g., fixed point or floating point execution units. Many floating point execution units, for example, are deeply pipelined. However, while pipelining can improve performance, pipelining is most efficient when the instructions processed by a pipeline are not dependent on one another, e.g., where a later instruction does not use the result of an earlier instruction. Whenever an instruction operates on the result of another instruction, typically the later instruction cannot enter the pipeline until the earlier instruction has exited the pipeline and calculated its result. The later instruction is said to be dependent on the earlier instruction, and phenomenon of stalling the later instruction waiting for the result of an earlier instruction is said to introduce “bubbles,” or cycles where no productive operations are being performed, into the pipeline.
One technique that may be used to extract higher utilization from a pipelined execution unit and remove unused bubbles is to introduce multi-threading. In this way, other threads are able to issue instructions into the unused slots in the pipeline, which drives the utilization and hence the aggregate throughput up. Another popular technique for increasing performance is to use a single instruction multiple data (SIMD) architecture, which is also referred to as ‘vectorizing’ the data. In this manner, operations are performed on multiple data elements at the same time, and in response to the same SIMD instruction. A vector execution unit typically includes multiple processing lanes that handle different datapoints in a vector and perform similar operations on all of the datapoints at the same time. For example, for an architecture that relies on quad(4)word vectors, a vector execution unit may include four processing lanes that perform the identical operations on the four words in each vector.
The aforementioned techniques may also be combined, resulting in a multi-threaded vector execution unit architecture that enables multiple threads to issue SIMD instructions to a vector execution unit to process “vectors” of data points at the same time. Typically, a scheduling algorithm is utilized in connection with issue logic to ensure that each thread is able to proceed at a reasonable rate, with the number of bubbles in the execution unit pipeline kept at a minimum.
It has been found, however, that while this configuration is highly desirable for a significant amount of code, there are certain algorithms that benefit greatly from higher scalar (i.e., single datapoint) issue availability. A conventional vector execution unit may be used to perform scalar math; however, only one out of the multiple processing lanes is used, which creates suboptimal performance, and significant underutilization of processing resources.
One such algorithm that benefits from high scalar multithread throughput is rasterization. Rasterization is a process in 3D graphics where three dimensional geometry that has been projected onto a screen is “filled in” with pixels of the appropriate color and intensity. Often the task of rasterizing a piece of geometry is parallelized, sometimes by splitting up the task across several units or threads based on the section of screen the pixel resides.
To interpolate various parameters between the vertices of an object, often the barycentric coordinates must be calculated for each pixel. These are generally three scalar values that correspond to how near a pixel is to each vertex. If the object is to be texture mapped, the texture coordinates of the pixel must be calculated by multiplying the barycentric coordinates with their associated vertex texture coordinates for each vertex, and obtaining the sum from those results. These texture coordinates are then used to load the correct pixels from the texture image to be drawn in the correct rasterized image pixel. The algorithm usually does not see a large benefit from using vector floating point instructions over scalar instructions. Often much of the algorithm calculation time is occupied with waiting for the texture image data to come back from the load, or stalling due to a register dependency. Other threads have the capability to continue progress during this stalling, but with a conventional vector execution unit typically only one vector or scalar instruction may enter the execution pipeline per cycle.
Other algorithms in 3D graphics such as a model to world transform algorithms greatly benefit from using vector instructions. Therefore, in a system intended to perform high performance 3D graphics, it would be beneficial to have the capability to process both vector instructions and scalar instructions in a more efficient manner than has heretofore been possible.