As semiconductor technology continues to inch closer to practical limitations in terms of increases in clock speed, architects are increasingly focusing on parallelism in processor architectures to obtain performance improvements. At the chip level, multiple processor cores are often disposed on the same chip, functioning in much the same manner as separate processor chips, or to some extent, as completely separate computers. In addition, even within cores, parallelism is employed through the use of multiple execution units that are specialized to handle certain types of operations. Pipelining is also employed in many instances so that certain operations that may take multiple clock cycles to perform are broken up into stages, enabling other operations to be started prior to completion of earlier operations. Multithreading is also employed to enable multiple instruction streams to be processed in parallel, enabling more overall work to performed in any given clock cycle.
One area where parallelism continues to be exploited is in the area of execution units, e.g., fixed point or floating point execution units. Many floating point execution units, for example, are deeply pipelined. However, while pipelining can improve performance, pipelining is most efficient when the instructions processed by a pipeline are not dependent on one another, e.g., where a later instruction does not use the result of an earlier instruction. Whenever an instruction operates on the result of another instruction, typically the later instruction cannot enter the pipeline until the earlier instruction has exited the pipeline and calculated its result. The later instruction is said to be dependent on the earlier instruction, and phenomenon of stalling the later instruction waiting for the result of an earlier instruction is said to introduce “bubbles,” or cycles where no productive operations are being performed, into the pipeline.
One technique that may be used to extract higher utilization from a pipelined execution unit and remove unused bubbles is to introduce multithreading. In this way, other threads are able to issue instructions into the unused slots in the pipeline, which drives the utilization and hence the aggregate throughput up. Another popular technique for increasing performance is to use a single instruction multiple data (SIMD) architecture, which is also referred to as ‘vectorizing’ the data. In this manner, operations are performed on multiple data elements at the same time, and in response to the same SIMD instruction. A vector execution unit typically includes multiple processing lanes that handle different datapoints in a vector and perform similar operations on all of the datapoints at the same time. For example, for an architecture that relies on quad(4)word vectors, a vector execution unit may include four processing lanes that perform the identical operations on the four words in each vector.
The aforementioned techniques may also be combined, resulting in a multithreaded vector execution unit architecture that enables multiple threads to issue SIMD instructions to a vector execution unit to process “vectors” of data points at the same time. Typically, a scheduling algorithm is utilized in connection with issue logic to ensure that each thread is able to proceed at a reasonable rate, with the number of bubbles in the execution unit pipeline kept at a minimum.
It has been found that with vector execution units, it is often desirable to provide support for programmatically shuffling, or permuting, individual elements in a vector operand for certain types of arithmetic operations. For example, in the area of 3D image processing, backface culling can often be accelerated through the use of vector permutation. Backface culling is the process of determining which triangles that make up a 3D object face the camera, and thus, are visible in a scene. Determining which triangles are visible allows the computer graphics software to spend most of its time dealing with only visible faces of objects, such that performance can be maximized. As an example, with any 3D cube, having six faces, at most three faces will be visible from any given camera position, so it is known that at least three faces will not be visible in a scene and can be ignored from the standpoint of later graphical operations such as applying textures to the faces.
To determine if a surface of an object is facing the camera, a dot product of the vector that denotes where the camera is pointing, and the surface normal vector of the surface, is calculated. Often, 3D objects are split up into triangles with the points of each triangle vertex stored with coordinates x, y and z in a three element vector. The surface normals of each triangle are not usually pre-calculated. To find this surface normal, and thus whether the triangle faces the camera, a cross product operation is typically performed between two vectors that make up two sides of the triangle, using the following equation:
      C    ×    T    =                                                              x              ^                                                          y              ^                                                          z              ^                                                                          x              c                                                          y              c                                                          z              c                                                                          x              t                                                          y              t                                                          z              t                                                  =                                                                      x                ^                            ⁡                              (                                                                            y                      c                                        ⁢                                          z                      t                                                        -                                                            y                      t                                        ⁢                                          z                      c                                                                      )                                      +                                                                                          y                ^                            ⁡                              (                                                                            x                      t                                        ⁢                                          z                      c                                                        -                                                            x                      c                                        ⁢                                          z                      t                                                                      )                                      +                                                                          z              ^                        ⁡                          (                                                                    x                    c                                    ⁢                                      y                    t                                                  -                                                      x                    t                                    ⁢                                      y                    c                                                              )                                          
Conventionally, permuting elements of a vector has been performed using a permute instruction, which operates on a vector operand stored in a register in a register file, shuffles the elements of the vector operand, and stores the shuffled vector operand back into the same or a different register in the register file. Thus, a cross product may be computed by a conventional vector floating point multiply add pipeline by first performing several permute instructions to move the vector elements into the desired positions for the multiply, then performing a first set of multiplies, then a second set, and finally performing an add instruction with the multiply results. In order to move the vector elements into the proper positions for the first set of multiplies, the following permute instructions may be used:
                    [                                                            x                t                                                                    y                t                                                                    z                t                                                                    w                t                                                    ]                            permute        ⁢                                  ⁢        1                                      =                ⁢                  >                                    [                                                            z                t                                                                    x                t                                                                    y                t                                                                    w                t                                                    ]                                [                                                            x                c                                                                    y                c                                                                    z                c                                                                    w                c                                                    ]                            permute        ⁢                                  ⁢        2                                      =                ⁢                  >                                    [                                                            z                c                                                                    x                c                                                                    y                c                                                                    w                c                                                    ]                                [                                                            x                t                                                                    y                t                                                                    z                t                                                                    w                t                                                    ]                            permute        ⁢                                  ⁢        3                                      =                ⁢                  >                                    [                                                            y                t                                                                    z                t                                                                    x                t                                                                    w                t                                                    ]                                [                                                            x                c                                                                    y                c                                                                    z                c                                                                    w                c                                                    ]                            permute        ⁢                                  ⁢        4                                      =                ⁢                  >                                    [                                                            y                c                                                                    z                c                                                                    x                c                                                                    w                c                                                    ]            
If each element of each four element vector is labeled x, y, z and w, respectively, the vector elements are initially laid out in the vector register file in that order. The aforementioned permute instructions multiplex the elements into the different positions shown above in preparation for the multiply and add operations performed later. Of note, the permute1 and permute2 instructions specify the same word ordering as one another, as do the permute3 and permute4 instructions. Conventional permute instructions, however, operate on single vector operands, and as such, a separate permute instruction is required for each vector operand.
As noted above, a conventional permute instruction is processed by reading a vector operand from a register in a register file, shuffling the operand elements, and writing the result back into a register in the register file. The shuffling is performed within the execution pipeline using a set of multiplexers. A vector arithmetic instruction then reads the shuffled vector operand from the register file and performs the vector arithmetic.
The conventional approach, however, has a number of drawbacks. First, since the permute instruction writes back into the register file, it occupies valuable register file space that could be used for other temporary storage. Second, the permute instruction write back of the shuffled vector operand into the register file causes a “read after write” dependency hazard condition for the later vector arithmetic instruction, as the later instruction is required to wait for the permute instruction to fully flow through the pipeline until it can retrieve the shuffled vector operand from the register file, which causes the issue logic to stall newer dependant instructions until the permute result is ready. This stalling causes cycles to go unused in the pipeline where stages are not filled, and particularly for deeply pipelined execution units, performance can be significantly degraded.
Another approach for shuffling elements of vector operands relies on swizzle instructions. Conventional swizzle instructions may precede vector arithmetic instructions in an instruction stream to shuffle vector operand elements in an execution pipeline for subsequent processing by vector arithmetic instructions. Swizzle instructions have the benefit of not requiring shuffled operands to be written back to the register file prior to use, which reduces the number of registers being used, and avoids the read after write dependencies in the execution pipeline. However, conventional designs require a swizzle instruction to be issued before each arithmetic instruction that requires a custom word ordering, as each swizzle instruction only specifies the custom word ordering for the immediately subsequent arithmetic instruction in the instruction stream. The use of such instructions, however, has been found to unnecessarily swell the code size of instruction streams that use the same word ordering for multiple arithmetic instructions, and therefore also degrades performance. In the backface culling example discussed above, for example, four swizzle instructions would be required to implement the four permutes required to perform the calculation, irrespective of the fact that only two unique word orders were required.
A need therefore continues to exist in the art for a manner of optimizing the permutation of operand vectors in a vector execution unit.