Field of Invention
The present invention pertains to the computing sciences generally, and, more specifically to an apparatus and method improved permute instructions.
Background
FIG. 1 shows a high level diagram of a processing core 100 implemented with logic circuitry on a semiconductor chip. The processing core includes a pipeline 101. The pipeline consists of multiple stages each designed to perform a specific step in the multi-step process needed to fully execute a program code instruction. These typically include at least: 1) instruction fetch and decode; 2) data fetch; 3) execution; 4) write-back. The execution stage performs a specific operation identified by an instruction that was fetched and decoded in prior stage(s) (e.g., in step 1) above) upon data identified by the same instruction and fetched in another prior stage (e.g., step 2) above). The data that is operated upon is typically fetched from (general purpose) register storage space 102. New data that is created at the completion of the operation is also typically “written back” to register storage space (e.g., at stage 4) above).
The logic circuitry associated with the execution stage is typically composed of multiple “execution units” or “functional units” 103_1 to 103_N that are each designed to perform its own unique subset of operations (e.g., a first functional unit performs integer math operations, a second functional unit performs floating point instructions, a third functional unit performs load/store operations from/to cache/memory, etc.). The collection of all operations performed by all the functional units corresponds to the “instruction set” supported by the processing core 100.
Two types of processor architectures are widely recognized in the field of computer science: “scalar” and “vector”. A scalar processor is designed to execute instructions that perform operations on a single set of data, whereas, a vector processor is designed to execute instructions that perform operations on multiple sets of data. FIGS. 2A and 2B present a comparative example that demonstrates the basic difference between a scalar processor and a vector processor.
FIG. 2A shows an example of a scalar AND instruction in which a single operand set, A and B, are ANDed together to produce a singular (or “scalar”) result C (i.e., AB=C). By contrast, FIG. 2B shows an example of a vector AND instruction in which two operand sets, A/B and D/E, are respectively ANDed together in parallel to simultaneously produce a vector result C, F (i.e., A.AND.B=C and D.AND.E=F). As a matter of terminology, a “vector” is a data element having multiple “elements”. For example, a vector V=Q, R, S, T, U has five different elements: Q, R, S, T and U. The “size” of the exemplary vector V is five (because it has five elements).
FIG. 1 also shows the presence of vector register space 104 that is different than general purpose register space 102. Specifically, general purpose register space 102 is nominally used to store scalar values. As such, when, the any of execution units perform scalar operations they nominally use operands called from (and write results back to) general purpose register storage space 102. By contrast, when any of the execution units perform vector operations they nominally use operands called from (and write results back to) vector register space 107. Different regions of memory may likewise be allocated for the storage of scalar values and vector values.
Note also the presence of masking logic 104_1 to 104_N and 105_1 to 105_N at the respective inputs to and outputs from the functional units 103_1 to 103_N. In various implementations, only one of these layers is actually implemented—although that is not a strict requirement. For any instruction that employs masking, input masking logic 104_1 to 104_N and/or output masking logic 105_1 to 105_N may be used to control which elements are effectively operated on for the vector instruction. Here, a mask vector is read from a mask register space 106 (e.g., along with input data vectors read from vector register storage space 107) and is presented to at least one of the masking logic 104, 105 layers.
Over the course of executing vector program code each vector instruction need not require a full data word. For example, the input vectors for some instructions may only be 8 elements, the input vectors for other instructions may be 16 elements, the input vectors for other instructions may be 32 elements, etc. Masking layers 104/105 are therefore used to identify a set of elements of a full vector data word that apply for a particular instruction so as to effect different vector sizes across instructions. Typically, for each vector instruction, a specific mask pattern kept in mask register space 106 is called out by the instruction, fetched from mask register space and provided to either or both of the mask layers 104/105 to “enable” the correct set of elements for the particular vector operation.
FIGS. 3a to 3e show the logical operation of prior art VINSERT, VEXTRACT and VPERMUTE instructions. Note that the names of the instructions have been abbreviated or otherwise simplified as compared to their actual name.
FIG. 3a shows the logical operation of a prior art VINSERT instruction. As observed in FIG. 3a, a first input operand corresponds to 128 bits of information 301_A and a second input operand corresponds to a 256 bit vector 302_A. A third, immediate input operand (not shown) specifies which half (low half or right half) of the 256 bit vector 302_A is to be replaced by the 128 bits of information of the first input operand 301_A. The resulting structure is stored in a destination/result vector having a size of 256 bits. The 128 bits of information 301, input vector 302_A and result are floating point values that can be 32 bits or 64 bits in size.
FIG. 3b shows the logical operation of a prior art VEXTRACT instruction. As observed in FIG. 3b, a first input operand corresponds to a 256 bit vector 301_B. A second, immediate input operand (not shown) specifies which half (low half or right half) of the 256 bit input vector 301_B is to be written over the lowest ordered 128 bits of a 256 bit vector stored in a destination register 302_B. Input vector 301_B vector is structured to be floating point values that are 32 bits or 64 bits in size. The instruction format may alternatively specify 128 bits in memory as the destination rather than the destination register 302_B.
FIGS. 3c through 3e show the respective logical operations of three different VPERMUTE instructions (VPERMILPS, VPERMILPD, VPERM2F128).
FIG. 3c shows the logical operation of the VPERMILPS instruction. As observed in FIG. 3c, the VPERMILPS instruction accepts an input operand 301_C that corresponds to a 256 bit input vector having eight 32 bit (single precision) floating point values. The result is also a 256 bit vector having eight 32 bit single precision floating point values as its elements 302_C. A second input vector (not shown) uniquely specifies, for each of the four elements in the lower half of the result, which of the four elements 301_C_1 through 301_C_4 in the lower half of the input vector 301_C is to provide the output element with its content.
FIG. 3c shows the operation for only output elements 302_C_1 and 302_C_5. Here, the content of output element 302_C_1 can be “filled” with the content of any of input elements 301_C_1 through 301_C_4. Which one of input elements 301_C_1 through 301_C_4 is selected to fill output element 302_C_1 is articulated in a (not shown) second input vector. Here, the second input vector contains a separate 2 bit control field for each of the eight elements in the output vector. The source for an output element in the lower half of the result 302_C must be chosen from the lower half of input vector 301_C. Likewise, the source for an output element in the upper half of the result 302_C must be chosen from the upper half of input vector 301_C.
Although not explicitly shown in FIG. 3c, the content of each of output elements 302_C_2 through 302_C_4 are uniquely specified as any of input elements 301_C_1 through 301_C_4 by way of the information contained in the second input vector. Similarly, as observed in FIG. 3c, the content of output element 302_C_5 is “filled” with the content of any of input elements 301_C_5 through 301_C_8. Again, which one of input elements 301_C_5 through 301_C_8 is selected to fill output element 302_C_5 is also articulated in the (not shown) second input vector. The content of each of output elements 302_C_6 through 302_C_8 is uniquely specified as any of input elements 301_C_5 through 301_C_8 by the (not shown) second input vector.
Another version of the VPERMILPS instruction uses an immediate operand instead of the second input vector to choose the selection pattern of the input vector 301_C. Here, the input element selection pattern for the lower half of the destination matches the input element selection pattern for the upper half of the destination.
FIG. 3d shows the logical operation of the VPERMILPD instruction. As observed in FIG. 3d, the VPERMILPD instruction accepts an input operand 301_D that corresponds to a 256 bit input vector having four 64 bit (double precision) floating point values. The result is also a 256 bit vector 302_D having four 64 bit double precision floating point values as its elements. A second input vector (not shown) uniquely specifies, for each of the two elements in the lower half of the result, which of the two elements 301_D_1 through 301_D_2 in the lower half of the input vector 301_D is to provide the output element with its content.
As observed in FIG. 3d, each of output elements 302_D_1 and 302_D_2 can be uniquely “filled” with either of input elements 301_D_1 or 301_D_2. Likewise, each of output elements 302_D_3 and 302_D_4 can be uniquely “filled” with either of input elements 301_D_3 or 301_C_4. Which input element is selected to fill a specific output element is articulated in a (not shown) second input vector. Here, the second input vector contains a separate 2 bit control field for each of the four elements in the output vector.
Another version of the VPERMILPD instruction uses an immediate operand instead of the second input vector to choose the selection pattern of the input vector 301_D. Here, the input element selection pattern for the lower half of the destination matches the input element selection pattern for the upper half of the destination.
For both the VPERMIPLS and VPERMIPLD instructions, the result is stored in a vector register specified in the instruction format of the instruction. The source of the first input vector is specified in the instruction format and corresponds to a vector register when the second input vector is utilized to determine the selection pattern. In this case, the source of the second input vector is also specified in the instruction format and corresponds to either a second vector register or a memory location. By contrast, if the immediate operand is used to determine the selection pattern, the source of the first input vector is specified in the instruction format and may be a vector register or a memory location.
FIG. 3e shows the logical operation of the VPERM2F128 instruction. As observed in FIG. 3e, the VPERM2F128 instruction accepts two separate 256 bit vector input operands 301_E, 302_E. Both the lower and upper 128 bit halves 303_E_1, 303_E_2 of a 256 bit result 303_E can be filled with any of the lower or upper halves 301_E_1, 301_E_2, 302_E_1, 302_E_2 of both input vectors 301_E, 302_E. The result is stored in a vector register specified in the instruction format of the instruction. The sources of both input vectors 301_E, 302_E are specified in the instruction format and may correspond to a pair of vector registers or one vector register and one memory location.