FIG. 1 shows a high level diagram of a processing core 100 implemented with logic circuitry on a semiconductor chip. The processing core includes a pipeline 101. The pipeline consists of multiple stages each designed to perform a specific step in the multi-step process needed to fully execute a program code instruction. These typically include at least: 1) instruction fetch and decode; 2) data fetch; 3) execution; 4) write-back. The execution stage performs a specific operation identified by an instruction that was fetched and decoded in prior stage(s) (e.g., in step 1) above) upon data identified by the same instruction and fetched in another prior stage (e.g., step 2) above). The data that is operated upon is typically fetched from (general purpose) register storage space 102. New data that is created at the completion of the operation is also typically “written back” to register storage space (e.g., at stage 4) above).
The logic circuitry associated with the execution stage is typically composed of multiple “execution units” or “functional units” 103_1 to 103_N that are each designed to perform its own unique subset of operations (e.g., a first functional unit performs integer math operations, a second functional unit performs floating point instructions, a third functional unit performs load/store operations from/to cache/memory, etc.). The collection of all operations performed by all the functional units corresponds to the “instruction set” supported by the processing core 100.
Two types of processor architectures are widely recognized in the field of computer science: “scalar” and “vector”. A scalar processor is designed to execute instructions that perform operations on a single set of data, whereas, a vector processor is designed to execute instructions that perform operations on multiple sets of data. FIGS. 2a and 2b present a comparative example that demonstrates the basic difference between a scalar processor and a vector processor.
FIG. 2a shows an example of a scalar AND instruction in which a single operand set, A and B, are ANDed together to produce a singular (or “scalar”) result C (i.e., AB=C). By contrast, FIG. 2b shows an example of a vector AND instruction in which two operand sets, A/B and D/E, are respectively ANDed together in parallel to simultaneously produce a vector result C, F (i.e., A.AND.B=C and D.AND.E=F). As a matter of terminology, a “vector” is a data element having multiple “elements”. For example, a vector V=Q, R, S, T, U has five different elements: Q, R, S, T and U. The “size” of the exemplary vector V is five (because it has five elements).
FIG. 1 also shows the presence of vector register space 104 that is different that general purpose register space 102. Specifically, general purpose register space 102 is nominally used to store scalar values. As such, when, the any of execution units perform scalar operations they nominally use operands called from (and write results back to) general purpose register storage space 102. By contrast, when any of the execution units perform vector operations they nominally use operands called from (and write results back to) vector register space 107. Different regions of memory may likewise be allocated for the storage of scalar values and vector values.
Note also the presence of masking logic 104_1 to 104_N and 105_1 to 105_N at the respective inputs to and outputs from the functional units 103_1 to 103_N. In various implementations, only one of these layers is actually implemented—although that is not a strict requirement. For any instruction that employs masking, input masking logic 104_1 to 104_N and/or output masking logic 105_1 to 105_N may be used to control which elements are effectively operated on for the vector instruction. Here, a mask vector is read from a mask register space 106 (e.g., along with input data vectors read from vector register storage space 107) and is presented to at least one of the masking logic 104, 105 layers.
Over the course of executing vector program code each vector instruction need not require a full data word. For example, the input vectors for some instructions may only be 8 elements, the input vectors for other instructions may be 16 elements, the input vectors for other instructions may be 32 elements, etc. Masking layers 104/105 are therefore used to identify a set of elements of a full vector data word that apply for a particular instruction so as to effect different vector sizes across instructions. Typically, for each vector instruction, a specific mask pattern kept in mask register space 106 is called out by the instruction, fetched from mask register space and provided to either or both of the mask layers 104/105 to “enable” the correct set of elements for the particular vector operation.
A number of vector instructions have been implemented and are already known. These include the VPBROADCAST, VPSUBB, VPADD, VPSHUFB, VPXOR, VINSERT, VEXTRACT and VPBLEND instructions. FIGS. 3a through 3g demonstrate basic operation of these instructions respectively.
As observed in FIG. 3a, the VPBROADCAST instruction accepts a single scalar value A as an input operand and produces a vector element R having A in each element of the vector. The VBROADCAST instruction can also be used to provide a resultant that corresponds to values looked up from a lookup table if an address of the lookup table and an index into it from where the values are to be looked up is provided as input operands.
As observed in FIG. 3b, the VPSUBB and VPADD instructions produce a vector resultant R whose constituent elements corresponds to the respective subtraction/addition of same positioned elements in a pair of input vectors A, B.
As observed in FIG. 3c, the VPSHUFB instruction produces a vector resultant R by “shuffling” elements of an input vector A according to a shuffle scheme defined by input vector B. Here, each element of B corresponds to a same positioned element in R. Each element of B effectively identifies which element of A is to be placed in its respective element of R.
As such, in the example of FIG. 3c, input vector B (as well as input vector A) are vectors whose respective elements are each eight bits (a byte) in size. For example, the notation “0x42”, as is understood in the art, is used to represent a byte whose upper four bits correspond to the value of 4 (i.e., “0100”) and whose lower four bits correspond to the value of 2 (i.e., “0010”). That is, the notation 0x42 represents a byte having bit sequence 01000010.
An implementation of the VPSHUFB instruction only uses a portion of each element of input vector B to identify an element of input vector A for inclusion in the resultant R. For example, one implementation only uses the lower half of an element of input vector B to identify a particular element of vector A. For instance, as observed in FIG. 3c, element 301 of input vector B is “0x02”. As such, the element 301 is specifying that the third (e.g., according to sequence, 0, 1, 2) element 302 in vector A is being selected for inclusion in the element of resultant R that corresponds to the same element location as element 301 in control vector B. Similarly, if an element of input vector B is 0x09 then the tenth element in input vector A is being selected for the same element location in resultant R.
In an embodiment where each vector is 128 bits and has 16 elements each of a byte in size, the lower half of an element of input vector B is four bits and can be specify any one of the 16 elements of input vector A (e.g., using hexadecimal form, the lower half of any element of input vector B can be any value from 0 to f). The value of the upper half of each element of input vector B is irrelevant except for any of values 8 through f, which corresponds to the most significant bit of the upper half of the element being equal to 1. In this case, the lower half of the element is ignored (i.e., it does not specify an element of vector A) and a value of 0 (00000000) is inserted into the element of the resultant R whose position corresponds to the element of input vector B having a most significant bit equal to 1.
The VPXOR instruction, as observed in FIG. 3d, provides in each element of resultant R the exclusive OR of same positioned elements in input vectors A and B.
The VINSERT instruction, as observed in FIG. 3e, prepares a resultant R by incorporating an input vector A into R and replacing either the higher half of elements or the lower half of elements in R with the lower half of elements of input vector B. In an implementation, whether the lower half of the elements of input vector B are inserted into the higher or lower half of R is determined by the setting of an immediate operand (e.g., if the immediate operand is a 1 the elements of B are inserted into the higher half of R, if the immediate operand is a 0 the elements of B are inserted into the lower half of R).
The VEXTRACT instruction, as observed in FIG. 3f, extracts, depending on the setting of an input parameter (in one implementation, an immediate operand), the higher half of elements or the lower half of elements of an input vector A and presents the extracted elements in the resultant R. For example, if input vector A is a 256 bit vector, resultant R will be the higher 128 bits or the lower 128 bits depending on whether the immediate operand is a 1 or a 0.
The VPBLEND instruction is akin to the VINSERT instruction but with more granularized control. The VPBLEND instruction, as observed in FIG. 3g, prepares a resultant R by incorporating an input vector A into R and replacing specific elements of R on an element by element basis with corresponding (same positioned) elements of input vector B depending on the settings of a mask input vector M. For example, if AIR is a 256 bit vector there are 32 byte sized elements. M is a 32 bit input vector where each bit corresponds to a unique element in A/R and B. If M contains a value of 1 in a particular location, the corresponding byte in B is incorporated into the corresponding byte in R.
A problem in previous processor implementations is that if a need arose to “roll” vector elements left or right, the compiler produced a long instruction stream that required at least one instruction to move each input element to its correct destination element location. For example, FIG. 4 shows a vector A and its constituent elements. If, for whatever reason, a compiler recognizes a need to “roll” the vector's elements to the left or right, at least one instruction will be constructed into the object code for each element in the vector. FIG. 4 shows an example where the vector A needs to be moved three elements to the left (in instruction execution parlance left and right directions may be reversed as compared to their representation on a hand drawn page). As such, at least one instruction is needed for each of operations 401_1 through 401_N-2 to create the needed vector N.