As semiconductor technology continues to inch closer to practical limitations in terms of increases in clock speed, architects are increasingly focusing on parallelism in processor architectures to obtain performance improvements. At the chip level, multiple processor cores are often disposed on the same chip, functioning in much the same manner as separate processor chips, or to some extent, as completely separate computers. In addition, even within cores, parallelism is employed through the use of multiple execution units that are specialized to handle certain types of operations. Pipelining is also employed in many instances so that certain operations that may take multiple clock cycles to perform are broken up into stages, enabling other operations to be started prior to completion of earlier operations. Multithreading is also employed to enable multiple instruction streams to be processed in parallel, enabling more overall work to performed in any given clock cycle.
One area where parallelism continues to be exploited is in the area of execution units, e.g., fixed point or floating point execution units. Many floating point execution units, for example, are deeply pipelined. However, while pipelining can improve performance, pipelining is most efficient when the instructions processed by a pipeline are not dependent on one another, e.g., where a later instruction does not use the result of an earlier instruction. Whenever an instruction operates on the result of another instruction, typically the later instruction cannot enter the pipeline until the earlier instruction has exited the pipeline and calculated its result. The later instruction is said to be dependent on the earlier instruction, and phenomenon of stalling the later instruction waiting for the result of an earlier instruction is said to introduce “bubbles,” or cycles where no productive operations are being performed, into the pipeline.
One technique that may be used to extract higher utilization from a pipelined execution unit and remove unused bubbles is to introduce multithreading. In this way, other threads are able to issue instructions into the unused slots in the pipeline, which drives the utilization and hence the aggregate throughput up. Another popular technique for increasing performance is to use a single instruction multiple data (SIMD) architecture, which is also referred to as ‘Vectorizing’ the data. In this manner, operations are performed on multiple data elements at the same time, and in response to the same SIMD instruction. A vector execution unit typically includes multiple processing lanes that handle different datapoints in a vector and perform similar operations on all of the datapoints at the same time. For example, for an architecture that relies on quad(4)word vectors, a vector execution unit may include four processing lanes that perform the identical operations on the four words in each vector.
The aforementioned techniques may also be combined, resulting in a multithreaded vector execution unit architecture that enables multiple threads to issue SIMD instructions to a vector execution unit to process “vectors” of data points at the same time. Typically, a scheduling algorithm is utilized in connection with issue logic to ensure that each thread is able to proceed at a reasonable rate, with the number of bubbles in the execution unit pipeline kept at a minimum.
It has been found that with vector execution units, it is often desirable to provide support for programmatically shuffling, or permuting, individual elements in an operand vector for certain types of arithmetic operations. For example, in the area of 3D image processing, texture processing is often performed during rasterization of a graphical image. Rasterization is a process in 3D graphics where three dimensional geometry that has been projected onto a screen is “filled in” with pixels of the appropriate color and intensity. A texture mapping algorithm is typically incorporated into a rasterization process to paint a texture onto geometric objects placed into a scene, and it has been found that texture mapping algorithms are readily adaptable to vector-based processing due to the ability to vectorize much of the data that is operated upon by a texture mapping algorithm, particularly with regard to the coordinates of objects and textures. However, a number of calculations performed by such texture mapping algorithms have been found to be implemented most efficiently when the operand vectors are shuffled.
In order to paint a texture onto a placed object in a scene, the pixels in each primitive making up the object are typically transformed from 3D scene or world coordinates (e.g., x, y and z) to 2D coordinates relative to a procedural or bitmapped texture (e.g., u and v). The fundamental elements in a texture are referred to as texels (or texture pixels), and being the fundamental element of a texture, each texel is associated with a single color. Due to differences in orientation and distance of the surfaces of placed geometric primitives relative to the viewer, a pixel in an image buffer will rarely correspond to a single texel in a texture. As a result, texture filtering is typically performed to determine a color to be assigned to a pixel based upon the colors of multiple texels in proximity to the texture mapped position of the pixel.
A number of texture filtering algorithms may be used to determine a color for a pixel, including simple interpolation, bilinear filtering, trilinear filtering, and anisotropic filtering, among others. With many texture filtering algorithms, weights are calculated for a number of adjacent texels to a pixel, the weights are used to scale the colors of the adjacent texels, and a color for the pixel is assigned by summing the scaled colors of the adjacent texels. The color is then either stored at the pixel location in a frame buffer, or used to update a color that is already stored at the pixel location.
Bilinear filtering, for example, uses the coordinates of a texture sample to perform a weighted average of four adjacent pixels, weighted according to how close the sample coordinates are to the center of the pixel. Bilinear filtering often can reduce the blockiness of closer details, but often does little to reduce the noise that is often found in distant details.
Trilinear filtering involves using MIP mapping, which uses a set of prefiltered texture images that are scaled to successively lower resolutions. The algorithm uses texture samples from the high resolution textures for portions of the geometry near to the camera, and low resolution textures for the portions distant to the camera. MIP mapping often reduces nearby pixelation and distant noise; however, detail in the distance is often lost and needlessly blurred. The blurriness is due to the texture samples being taken from a MIP level of the texture that has been pre-scaled to a low resolution in both the x and y dimensions uniformly, such that resolution is lost in the direction perpendicular to the direction that the texture is most compressed.
Anisotropic filtering involves taking multiple samples along a “line of anisotropy” which runs in the direction that the texture is most compressed. Each of these samples may be bilinear or trilinear filtered, and the results are then averaged together. This algorithm allows the compression to occur in only one direction. By doing so, less blurring often occurs in more distant features.
In each of these types of filtering algorithms, permuting the elements of the vectors being operated upon can improve the performance of such algorithms. Conventionally, permuting elements of a vector has been performed using a permute instruction, which operates on an operand vector stored in a register in a register file by shuffling the elements of the operand vector and storing the shuffled operand vector back into the same or a different register in the register file. If each element of each four element vector is labeled x, y, z and w, respectively, the vector elements are initially laid out in the vector register file in that order. The aforementioned permute instructions multiplex the elements into their different positions and store the shuffled elements back into the register file in preparation for vector operations to be performed later. Thus, for example, an operand vector with x, y, z and w words could be permuted by a permute instruction to generate a shuffled operand vector with the words ordered as y, z, x, w. Conventional permute instructions operate on single operand vectors, and as such, a separate permute instruction is typically required for each operand vector.
The conventional approach, however, has a number of drawbacks. First, since the permute instruction writes back into the register file, it occupies valuable register file space that could be used for other temporary storage. Second, the permute instruction write back of the shuffled operand vector into the register file causes a “read after write” dependency hazard condition for the later vector arithmetic instruction, as the later instruction is required to wait for the permute instruction to fully flow through the pipeline until it can retrieve the shuffled operand vector from the register file, which causes the issue logic to stall newer dependant instructions until the permute result is ready. This stalling causes cycles to go unused in the pipeline where stages are not filled, and particularly for deeply pipelined execution units, performance can be significantly degraded.
Another approach for shuffling elements of operand vectors relies on swizzle instructions. Conventional swizzle instructions may precede other vector instructions in an instruction stream to shuffle operand vector elements in an execution pipeline for subsequent processing by vector instructions. Swizzle instructions have the benefit of not requiring shuffled operands to be written back to the register file prior to use, which reduces the number of registers being used, and avoids the read after write dependencies in the execution pipeline. However, conventional designs require a swizzle instruction to be issued before each sequence of vector instructions that require a custom word ordering, as each swizzle instruction only specifies the custom word ordering for the immediately subsequent sequence of arithmetic instructions in the instruction stream. In addition, the use of such swizzle instructions has been found to unnecessarily swell the code size of instruction streams that use the same sequence of word ordering for multiple arithmetic instruction sequences, and therefore also degrades performance.
A need therefore continues to exist in the art for a manner of optimizing the permutation of operand vectors in a vector execution unit.